Discover more from The Analytics Engineering Roundup
This post is not about ChatGPT
This week was a bit of a breakthrough moment for AI models, and ChatGPT is… everywhere.
I think ChatGPT is genuinely exciting because it’s one of the most accessible examples of what’s possible in AI today. The revolution didn’t start this week though, but back in 2021 with things like the GitHub Copilot beta giving you surprisingly helpful code suggestions. At the time, it seemed too early and too uncanny. But ChatGPT just proved to anyone who can interact with a chat bot that there’s a there there:
Having said that, a fair bit has already been written about ChatGPT that I don't need to repeat. So instead, I’m going to point you to an article by Greg Meyer:
Actually in this issue:
A new Mode and where it fits into your modern data stack mental model
Modal by Erik B.
A 10/10 data meme by Nate Sooter
Ready? Let’s go!
The Goal of Data Teams
I feel incredibly called out by this Tweet. Katie — you have me 100% pegged and I love it. I lean on the crutch of this phrase a lot, and it’s just not good.
Organizations have been doing just fine at decision making until we data people came along (with a couple of notable exceptions — looking at you 2008 👀).
It’s also a little self-centered to say our jobs as data professionals are to help organizations and their leaders make decisions with data. It implies that without us, decision making will happen without data. Of course decision makers will use data — maybe it won’t be in a fancy notebook, but in a detailed spreadsheet instead. (Unless its willful ignorance and then no amount of data will help you make a rational argument to the contrary).
That doesn’t mean we all need to pack up and go home though:
To me this means the following:
Constantly asking: “Are we measuring and paying attention to the very best things that represent the drivers of our business?”. Seek to constantly improve what’s possible. Repeat.
Helping to make sure that when there is an important change in one of the business drivers, someone in the organization who is empowered to react to this change has the information they need to be successful.
Helping folks across the business understand how they contribute to those drivers, and how to measure their own success in terms the rest of the business understands and finds valuable.
I agree with Katie that we can start with really small and easy steps:
Yes! This. I would also add to the bottom of that email three sentences of context and framing about why this matters to the business (ie why the recipient should care) and a recommendation for action if I have one.
Not every data insight needs to end with a recommendation for action because that implies the data team knows everything about well.. everything. But if you’re sending an e-mail to someone with high influence but limited attention — you should 100% take the opportunity to share your hypotheses and recommendations and invite conversation.
Elsewhere on the internet…
Pulling stories out of data
Some time earlier this year, we talked about how everyone explains “how to Analytics” differently, the same way every new person trying to teach you to get up on a wakeboard will do it differently too.
Randy’s post is another flavor of “how to Analytics” that I recommend to read in full. It resonated with me because he touched on an aspect of analytics that’s a little messy and can be uncomfortable to talk about: there is always less certainty than we’d like when it comes to pulling stories out of data.
…data analysts are relying on their domain knowledge to generate a set of hypotheses that fit the data that is available. Sorta like in a mathematical induction proof, an analyst must come up with a formula that fits the pattern of the n observations of data in front of them. When new data stays consistent with the formula, then they have increasing confidence that their story is on the right track. BUT there can always be a single counter-example that pops up 50,000 entries later that disproves the story. There are no guarantees to truth.
This is a difficult space to occupy as a data leader. On the one hand, you want to tell a compelling and clear narrative: “We found X pattern, and believe it’s the result of Y change in the business, and here’s Z thing we should do about it”. On the other, that compelling and clear narrative often has more caveats to it than can be easily expressed in a quick one pager.
“If you want certainty, run an experiment” is the age old data adage. But that requires tooling funding, buy in from your engineering team about integrating a new platform into the critical path of your business’ application… and a volume of users and traffic necessary to hit power on all those A/B tests — volume that many businesses haven’t actually reached yet. And it’s the businesses who are scaling rapidly that most look to anchor on the “certainty of hard data”.
My spicy take (for a data industry newsletter) on getting yourself out of this conundrum: invest in a user researcher (or two!) who know SQL and possibly how to run a survey. Invest in humans who can navigate quantitative data but also know how to speak with your users and customers. Who know how to not only pull stories out of existing data, but tell stories with new data they create for you.
As of this week, Mode has a facelift, a dbt semantic layer integration (but if you were at Coalesce this year, you already knew that) and a really intriguing new feature called Datasets.
I immediately thought about PowerBI’s and Tableau’s similar primitives when watching the Datasets announcement and it sounds like Mode’s Datasets will be playing in the same space. Here's a great tl;dr about how Datasets and the dbt Semantic Layer integration sit together within Mode:
I agree with Benn on the need to solve the problem of data distribution. Just like it is important to separate the data layer and your control layer in your transformation step, it's also important for us to reason about what the best user experience is for distribution of said dataset, and who are the folks who will find it most valuable. I think dbt Semantic layer + Mode Datasets is an example of this kind of focus on user experience. I appreciate that we make iterations as a data industry that optimize for reducing user friction first, even at the cost of some short-term logical complexity.
What I have been working on: Modal
I’ve been excited to hear about what Erik’s been cooking up!
if I had to condense how to make engineers productive into one thing, it would be something like: make the feedback loops fast […]
Data is sort of weird because you have to run things on production data to have these sort of feedback loops. Whether you're running SQL or doing ML, it's often pointless to do that on non-production data. This violates a holy wall for a lot of software engineers: the strict separation of local and prod.
This. I agree with Erik that you can’t take software engineering tools, platforms and practices and apply them 1 to 1 to data team workflows. We talk about adopting software engineering best practices a lot in this newsletter, but also think very hard about how they need to evolve for data workflows for this exact reason.
Debugging data workflows is so much harder if you are trying to reproduce it locally on your machine. You end up spending more time figuring out how to match your environments than debugging the underlying problem. The issue is that production datasets are mutable by design and you have to reproduce both the state of the system (your control plane) and your business logic layer (your data plane) to accurately reproduce what's going wrong. It is far simpler to focus on debugging just one — your business logic layer while keeping your control plane constant.
Modal is enabling this by letting you write code against infrastructure and a compiler in the cloud. dbt Cloud does this by letting you write transformation code that defines your business logic against the same infrastructure and the same version of production data. Just imagine a world where more tools in the modern stack become equally aware of and intentional about making this distinction!
What’s most interesting to me about Modal is that it’s not built on an existing Docker/Kubernetes ecosystem. Instead, the team chose to start from scratch to be properly data cloud native. I think this is smart (and hard) because Docker/Kubernetes are primitives that emerged in a world that aimed to simply abstract your physical server machine, not tear it's architecture down entirely.
Watching this one with a lot of interest!
And finally, since you made it all the way to the end, some Ops real talk from Nate:
That’s it for this week!