Analytics Intelligence Everywhere

Dispatches from GenAI and what AI could mean for data teams

Feb 19, 2023

Pop quiz!

Did you know that the European Southern Observatory took the first ever picture of an exoplanet (aka a planet outside our solar system) in the year 2004?
Do you think an incorrect answer to (1) is worth 100 billion dollars?

If by any chance, you are Google’s much anticipated new chatbot Bard, then no - you did not know this, falsely claiming in a recent demonstration that the James Webb Space Telescope was the first to snag a photo of an elusive exoplanet.

This is not the first picture of an exoplanet. It is not a star. It is in fact, Chorizo

The answer to (2), in the peculiar circumstance where you are Google demoing the Chatbot that may or may not be the future of search - then yes, missing a factual question can cost you the big bucks.

But why did one innocent little mistake by a chatbot cause so much friction? And why open an article on the future of data teams with obscure trivia about exoplanets and chatbots?1

The markets had a strong response to Bard’s miss for two reasons:

Unless you’ve been living under a rock (and a very large rock at that) you’ve seen the hype waves from ChatGPT hitting every industry thought leader, thinkfluencer and prognosticator. The hype is real - but so is the fact that ChatGPT has proven to every business leader that we need to be taking the advances in artificial intelligence very seriously. This is doubly true for the tech giants, who are seeing the sands shift beneath them in ways that they haven’t had to contend with since the rise of smartphone.
Despite their undeniable promise, Large Language Models (LLMs) are weird. They can, as in the case of Bard, hallucinate2 false answers to questions. Or they can instantly turn from helpful search engines to something far stranger.

Whomst among us has not been tempted to respond to a request to “just pull a number real quick” with something along these lines

The fallout from Bard’s exoplanet mishap wasn’t just about that singular error - it raised questions about how we’ll be able to integrate these models into practical applications and business critical systems. It’s a dream to have a system that intelligently feeds you the correct answers to questions with instant knowledge and context over internet-sized surface areas, but it’s a nightmare that these systems might hallucinate3 a fake answer as easily as give you a real one.

This is where we start to see what the LLM revolution could mean for data teams - balancing the promise of instantaneous knowledge with the need to ensure guardrails and reliability. We’ve got great reason to think this is doable - but before we get there, we have to take one more journey - to see how the most advanced operational implementors of LLMs are already laying the groundwork for where we all might be going.

GenAI and the unmistakeable feeling of the future coming at you fast

Last week I had the chance4 to attend the first conference on Generative AI hosted by Jasper. Jasper is a pioneer in the area of enterprise deployment of LLM technology with a focus around using AI to generate marketing content. Their user base are some of the first people to be putting into practice a new set of LLM enabled workflows.

The overwhelming mood at the conference was giddy excitement mixed with stunned disbelief.5 The talks were an interesting juxtaposition between more traditional roadmap and feature updates for Jasper alongside wide ranging discussions about the future of AI. Everyone there felt something was coming, something with the potential to be big on a scale we’ve rarely witnessed before.

It’s one thing to hear that the future comes at you fast. It’s another to see it first hand.

The most interesting conversations I had weren’t with starry eyed futurists talking about scaling laws and the rise of AGI. They were with pragmatic marketing leaders who were there not because they were AI enthusiasts, but because they had already integrated Jasper into their content creation workflows and are seeing real value right now.

I talked to an HR marketing professional that told me her whole team is using Jasper to accelerate their content creation efforts. The outputs aren’t perfect, but they are more than good enough to improve the speed and efficiency of her team, particularly on more rote or repetitive assignments.

Without doubt the most common complaint was one that Jasper themselves brought to light with their biggest product launch of the day - that outputs tend to be generic and lack organizational context.

Their solution: Jasper Brand Voice which lets you fill in details of your companies history and positioning and include them in the working memory of your model. To get models that know how to create marketing materials which feel authentic and real, you need to bake some foundational knowledge into them.

Now Jason, you might be thinking, that’s all well and good for content marketing but I run a data team. I can’t type up a list of bullet points that contain everything an AI assistant would need to know about my data. I’d need a system that integrated together all of the detailed knowledge I’ve accumulated about data in my organization and how it is connected. I’d need some sort of … layer.

Self serve for real this time - stable interfaces and dynamic engagement

It’s not really theoretical at this point how all of this plays out for data teams - Michael Irvine and Delphi have shown us what this looks like. Because data practitioners have a new tool in their toolbox - one that fits in with dynamic chat perfectly and whose strengths are tailor made to help compensate for areas where LLMs and chatbots fall short.

An interface that allows for the creation and curation of business metrics in a way that enables you to query them and trust that you’ll get the same answers every time - no hallucinations. That allows your data team to build a robust and stable interface for your data model on top of a mature transformation framework. The dbt Semantic Layer.

I know, I know, take it to #vendor-content. But watch this:

I’ve dreamt of a slackbot that can answer data questions ever since I tried and failed to get my old team using Lookerbot - great idea that wasn’t ready for primetime when I tried it.6 But the combination of the semantic layer and LLM powered analytics bots shows real promise for being the solution I went looking for years ago.

With the rollout of improved LLM models on which to build services like Delphi and upcoming increases in functionality for the Semantic Layer, it will soon be the case7 that many ad-hoc and ongoing data requests will be handled by improved tooling. Which is great, because that lets us focus on the hard stuff moving forward.

The two hard problems for data teams

So if we’re not spending our days answering ad-hoc questions about what revenue was in EMEA last week - what are we doing?

We’re focusing on what have always been the two hard problems for data teams to solve:

Creating reliable, accurate and timely datasets that are based upon the business logic of the organization and reflect the real life problems that need to be solved
Building trust and engaging with stakeholders to make data informed decisions

These two problems represent two deep and powerful skillsets and ones that are going to remain distinctly human for some time to come.8

The work of creating reliable and accurate datasets is not just technical and complex - it relies upon the fundamentally creative act of determining what should be reported and why. This is absolutely work where things like GitHub CoPilot, Hex Magic or other LLM assisted tooling could be integrated into workflow - but any analytics engineer will tell you that rarely is the actual writing of SQL or Jinja the most difficult part of their job. It’s working with the business to figure out what needs to be reported on and why.

Critics of LLMs will often point out that their outputs are less and less “artificial intelligence” than repackaged and remixed variants of human intelligence. In the case of operationalizing LLMs in your organization - this is far more a feature than a bug, allowing you to have your human intelligence encoded in the semantic layer then dynamically queried by an LLM.

But it’s not just creating the datasets - even in a world where self-serve is a fully solved problem we will always come back to the beloved, nebulous work of “building trust and engaging with stakeholders”. The best data teams know that we aren’t just here to deliver spreadsheets or even to “drive insights”. We’re here to be trusted partners of organizational decision makers, helping guide strategic and tactical decisions via our analytics expertise and our unique place in the org.

It all comes down to trust. It was true when Barry said it and it’s going to be just as true moving forward. Your data team’s currency, as ephemeral and difficult to quantify as this may be, is trust.

Soon, we’re going to have new tools to help us build and maintain trust. Our analytics won’t just live in our heads or our business systems or our BI tools. It will be available, intelligently, across systems and backed by standards and stable interfaces. Once we get that working, well then it’s time to focus on the hard stuff.

From around the web:

Metricalypse Now by David Jayatillake

David puts together some great analysis on dbt Labs’ acquisition of Transform and what it means for data practitioners. Good companion piece to the segement above.

This is the first time we’ve had a semantic layer that wasn’t tied into a BI tool, of high quality, that could be used by tens of thousands of organisations by the end of this decade. If we as a community choose to adopt it, and now there are many reasons to do so and many less not to, we could end up with a semantic layer as universal as dbt-core is for transformation today.
Data folks, when you deal with your data consumption vendors this year, ask them about how soon they will be integrating with the new dbt semantic layer. Don’t ask them “if”, ask them “when” and how deep the integration will be.
This is as much an opportunity for us as it is for dbt Labs: we can have a ubiquitous semantic layer, independent of any cloud or BI tool, that will integrate with our tools.

Brainly’s Data Mesh Journey by Pedro Mir

The data mesh, as a concept combining technical and social aspects of managing large and complex data products has proven remarkably resilient.

I really enjoyed hearing what worked and what was challenging in the implementation. It’s clear that giving teams within an organization autonomy over the data they control and are subject matter experts in is a huge win - and that coordinating that across different areas of the business can prove challenging.

If you have strong opinions on this topic I implore you to read Jeremy’s proposal for implementing multi-project constructs in dbt Core.

dbt is jQuery, not Terraform by Kshitij Aranke

People give me weird looks sometimes when I tell them that dbt is fun. But Kshitij who joined the dbt Core team recently and is a fellow Bret Victor enthusiast knows the truth.

I can enter a flow state in dbt ... I can feel my dbt models coming to life thanks to the tight local feedback loop. I don’t need to flit between reading documentation and writing code, since dbt fits in my working memory. I’m free to play in a dbt sandbox without worrying about bankrupting my employer.
That last feeling – play – makes working in dbt a deeply personal experience

The Turbine Viewpoint by Emilie Schario

One of the true joys of the past few years has been watching data leaders turn their attention to new problems and start new organizations. Longtime Community member Emilie Schario (who gave one of the all time great Coalesce talks) has delivered the Viewpoint for her new company Turbine - and it’s ambitious.

Check this out - if for no other reason than because it managed to do the impossible and get me excited about the concept of an ERP.

When ERPs- enterprise resource planning software- were first introduced, they made lofty promises to companies of automation, improvement, and a new brain. ERPs aimed to hold all of your business’s operational data. They were the brain of a company, with all the information you needed to run an efficient and effective operation in one place.

That’s it for the week - thanks for reading! And if you’re building something that fits in with themes discussed above, shoot me a message on the dbt Slack.

Look for my forthcoming science fiction short story collection “obscure trivia about exoplanets and chatbots” coming your way soon

Yes, this is actually the technical term for it

Or, if you prefer, bullshit

Thanks to Cameron Afzal from our team for generously donating me his ticket after he got pulled into an offsite

Except for the Web 3 expats, who of course saw this coming the whole time

My one success here was naming the Lookerbot instance on our company slack “Ganzbot”, a name by which our old exec team calls me to this day

A good rule of thumb right now is not to take anyone making confident predictions about the future of AI seriously, so please take this with a good deal of humility

Seriously - remember what I said above about people making predictions.

The Analytics Engineering Roundup

Discussion about this post