Discover more from The Analytics Engineering Roundup
LLM Implications on Analytics (and Analysts!)
The capabilities are impressive, but will this actually change your job?
A new episode of the Analytics Engineering Podcast just dropped! And it’s a doozy. In it, Julia and I speak with longtime collaborators Mike Stonebreaker and Andy Palmer. If you’ve been in the data space for a minute, you’ll know Stonebreaker in particular—he has been behind some of the most iconic database systems of all time.
This was one of the most fun episodes we’ve ever done. Well worth a listen.
Enjoy the issue! :D
It’s been a minute since I saw a Jeff Dean post make the rounds, and this one is…impressive. Industry gossip would have one believe that ChatGPT has Google “on the back foot” in the “AI war”. I have a hard time buying this narrative, but I wouldn’t be surprised if Dean and Google specifically published this brag-fest to attempt to put that narrative to rest.
There’s honestly too much in the post to even attempt to comment on it except to say that you will definitely learn something from it even if you’re a close observer to the space. There’s just a lot of very cool stuff in there.
On the narrative…specifically, is it possible that an interface like ChatGPT could take real market share from the cash cow of traditional web search? Do people want machine-generated direct answers to their questions more than they want a list of ten blue links? Maybe. But I don’t think that’s a very interesting question. Google search results already include a ton of just-answer-the-question-for-me interfaces; there’s no particular technical reason why they couldn’t continue to evolve their interface to surface more direct answers, including conversation ones if that’s actually what users turn out to want for a given type of questions.
As usual, I think the more interesting question is a business model one. Google monetizes by selling placements in its search results; do more conversational interfaces lend themselves naturally to product placements in the same way? Honestly I think it would be pretty creepy if ChatGPT started, in conversation, pitching me on a product 😬 In order for this interface to take off, it needs to figure out a magical monetization scheme.
For my money, I’m betting that that looks a lot more like selling API access to developers instead of directly productizing. There are just really a lot of use cases being demoed right now. Here’s a fun one.
I’m so impressed with this. There are a million small things that can be done, which is exactly what discovery of a new generalized technology is supposed to feel like.
More relevant to our space is the automation of the question-and-answer process that all data teams engage in constantly. I’ve covered LLM’s abilities to write SQL before, but this is a particularly in-depth and useful teardown. The author’s conclusion was that the model got a lot of things, but by no means everything, right. This feels much more in the category of analyst-productivity-tool rather than analyst-replacement-tool; you still need the human who can reason symbolically evaluate the veracity of the results.
Seek is working on the same problem—data exploration, facilitated via language interfaces—and is particularly focused on the “making sure to say true things” angle. From Seek’s CEO, Sarah Nagy, in the article:
I predict that, as the generative AI hype cycle plays out, more conversations will be had about the flaws in the quality of AI-generated content, and how users can protect themselves from any inaccuracies.
Here’s another fascinating use case: invoking GPT-3 from inside of a dbt pipeline 👀
Today that feels…very non-performant and expensive. But if you read that post, you will get a sense of how incredibly powerful this could be. This is kind of breaking my brain a little bit so I’m going to put that thought on a shelf and come back to it later.
To wrap up this section: my strong belief is that we should anticipate meaningful innovation in analytical user experiences over the coming five years as a result of LLMs. There is work to be done, but the promise is too real and there is just too much value for it not to happen.
WOW I’m blown away by this work. Christophe and the team at Teads extended the dbt Bigquery adapter to support partition copying—improving the performance on large incremental models by (wait for it) 250x!??! The article is fantastic, and if you use dbt-bigquery, this functionality is already available to you on v1.4+. Thanks for writing this up Christophe!
Very solid, very pragmatic, piece on implementing data contracts in the data warehouse by Daniel Dicker. As the conversation around contracts has progressed, it has become both more detailed and more pragmatic—love both of these trends. We’re continuing to push on the topic of contracts (and, relatedly, distributed ownership of data domains) internally. The hype cycle has slowed (which is good IMO), but these topics are not going to stop being important.
Is “analytics engineer” a real job or are we all just drinking the kool-aid? Good news for all of us: it sounds like we are real.
In all seriousness, it often isn’t clear to folks coming from different entry points why there is a need to differentiate “analytics engineer” from “data engineer” or “BI developer” or other roles. This video does a solid job of talking through it.
Arpit is on a roll talking about customer data—or, as he would prefer to call it, “audience data”. I’m a fan of this change in thinking and I’ve experienced this shift myself as relationships between humans and organizations have changed dramatically since I started doing digital analytics in 2009. It used to be that we only had data on our customers—think K-Mart in the 1990’s—but that just couldn’t be less true today. Most of the people you likely know about likely haven’t paid you anything.
Arpit is writing a whole series on this broader topic and so far has published three posts:
Strangely enough, we all spend so much time talking about “data” in the abstract that we rarely talk about it in more specific terms. Data-about-people is definitely the single most important type of data (controversial?), and how we think and talk about it matters.