Why the modern data stack matters in the AI age
I'm at the Modern Data Stack. I'm at the Intelligence Explosion. I'm at the combination Modern Data Stack Intelligence Explosion.
There’s a mystery that’s been rattling around in my head.
In 2021 through early 2023, the data space and specifically “The Modern Data Stack” was arguably the highest energy, most dynamic area of the tech sector, and certainly dominated the discourse.
In late 2022, the ChatGPT moment happened and all of the oxygen, immediately, became sucked up into AI.
The mystery on my mind has been - what is the connection here? Is it just an accident of history that right before we got AI systems that actually work at scale, we were focusing on centralizing and modeling our data?
It’s completely possible that the answer is yes. It’s felt that way at times, but over the past year and particularly over the past few months, as AI systems move outside of narrow chat windows and become more integrated into our workflows, two things are becoming clear:
Complex AI workflows are going to draw many of the learnings from data engineering
LLMs will need access to the data produced by your analytics workflows in order to be truly useful for many use cases
Both of these facts feel relatively obvious now and immediately valuable in the near term, compared to other ideas that felt more … grasp-at-strawsy like “data teams will be the keepers of your organizational data to custom fine tune models for your business”.
So what changed? Why do these feel practical and useful now compared to even a year ago?
Because we’re actually starting to roll LLM systems out in the real world - and quickly. Even in their nascent state it’s clear that this is not hype, that there’s real value here, today. But it’s also becoming clear that the problems and lessons that brought us to the modern data stack haven’t gone away in this brave new world - although the ways those problems are solved and the systems they are being solved within may change dramatically.
At this point, you are probably familiar with the frustrating experience of going to a new website and being presented with some sort of chatbot interface and not being entirely certain what questions you can ask it.
The chatbot probably works extremely well for the set of context it has access to. But you probably don’t know what exactly it has access to, and that underlying guessing game means that what should be (and often are!) incredibly useful interfaces end up feeling slapped on and piecemeal.
The problem is largely not the models which have gotten extremely good for most questions you might ask of them. The problem is that they often literally don’t have the right information to give you the correct answer.
It does not matter how smart a model is, or how good it is at in context learning, if the only answer or a path to the answer can’t be added into its context because its locked somewhere in a single reddit post from last week, a proprietary document or in your data warehouse. The good news is that we’re quickly moving towards a world where that context isn’t locked up anymore and there are protocols and standards for accessing it.
But what context should be provided? How do you know it’s right? Is it accurate? Who has access to it?
These are all questions we’re going to have to learn to answer in our AI system. And it’s gonna be a doozy.
It is extremely non-trivial to feed the right context to LLMs at the right time
Probably the most interesting thing about adding new knowledge sources to LLM workflows is the way that LLMs magnify both the best and worst aspects of working with a particular system, almost to a fault. The time to answers is quicker, you can pull threads together - you’re always feeling movement. But any cracks in the system you’re using to feed the model context become immediately obvious.
We talk about “context” like it’s a monolith, but the underlying context we’re feeding is ultimately going to look something like “all of the mechanisms that humans have created for storing and conveying information”.
Let’s look at how this is going in practice - the good and the challenges:
LLMs + Internet search:
What it is: Integrate public web data into LLM queries
Why it’s great: This massively broadens the ability of LLMs to pull in information. This is really useful for when you require specific, granular information pulled from the real world (I like to use this for finding restaurants).
The challenges: SEO-bait works extremely well on LLMs. Under the hood, it’s still performing some sort of traditional web search - the same tools and tactics that people use to climb to the top of Google searches work on an LLM. Try an experiment right now - go do a Google search for “Best Pillow 2025”. Do you have any reasonable way of breaking down the answers and getting to ground truth?1 If it’s hard for you, it’s going to be hard for the LLM.
LLMs + Internal company documentation:
What is it: Search over internal, unstructured data like NotionAI and Slack
Why it’s great: I want to be incredibly clear - I love NotionAI. It is the single closest I’ve ever felt to being able to fully wrap my head around a complex organization of hundreds of people and be able to learn what other teams are working on. It fundamentally broadens my aperture in knowing what is going on at dbt Labs and why - from answering quick policy questions to making sure I can keep track of the latest company objectives.
The challenges: Unless you have incredibly strong document hygiene, you’re going to find messy and conflicting information. A simple question like “when is Coalesce 2025” can sometimes end up with 3 results - maybe we were initially thinking a different date and that date still lives in a document somewhere. Maybe someone just accidentally typed in the wrong date and left it there. The models need, as part of their context, not just all of your documentation, but signals as to which documentation is up to date, correct, organizationally approved.
LLMs + Metrics and structured data:
What it is: All of the data that lives in your data platforms, your key metrics, customer information, business entities
Why it’s great: I don’t want to be too dramatic here but being able to analyze a complex dataset using conversational analytics on top of a trustworthy interface feels like magic. Especially for data that you know well, it honestly feels like you are getting superpowers, with the answer to any question you might have available at the tip of your fingers.
The challenges: What could possibly go wrong giving an LLM access to your data warehouse? Text-to-sql is good and getting better, I have no doubt that we will get to the point where, given a well structured problem and sufficient context about the underlying data, LLMs are going to be very successful at getting an answer that is reasonable and correct. But will it:
Be consistent across an organization? Will it be a single source of truth, based on vetted and well understood business concepts? Or will it be a very clever vibe coded 1200 line sql script that no one realistically is ever going to read.
Can you put it in the hands of your executives and have them trust the output? Not just as interesting anecdata - but as something that they can make decisions and take action off of?
Is it going to be governed? Will it know who the end user behind the query is, what data they should have access to and what they shouldn’t?
Each of these information sources add impressive and interesting capabilities to the underlying power of LLMs. When combined in a single interface, there is a combinatorial explosion of usefulness2 as the different information sources each unlock new capabilities.
Imagine you are the CEO of a restaurant chain planning whether to expand into a new territory. With access to these three information sources, you could prepare a deep research style report that:
Understands the business context and strategy for a potential expansion based on your internal company documentation
Has the ability to query and understand the actual data and financial metrics available to you via your data platform
Searches the web for macro-level data and benchmarks about the location you’re planning to expand into
Sounds incredible right? This is totally doable, today - although it requires a bit of duct-taping systems together to make it work.
But it also sounds … daunting. Because each of the failure modes described above can cascade throughout this system. What if the model YOLOs a revenue definition that’s reasonable, but based on information from an out of date company document? What if it pulls benchmarked financial data from the web that ended up being wrong? The real world contains a whole lot of complexity and our systems need to be designed in a way to get the benefits here while having the appropriate guardrails to establish some sort of ground truth.
So why now
I want to return now to the question that I posed at the start of this - why were we all convinced that building systems for centralizing and managing your company data at scale was the right problem to solve directly before LLMs started to soar?
The answer lies in recognizing that what seemed like an accident of timing was actually a foundation being laid. The Modern Data Stack is not just about better dashboards —but about creating standardized and reliable workflows and interfaces across your entire data ecosystem that can power increasingly sophisticated use cases, at scale. It turns out this is just as necessary for AI as it is for humans.
We built the modern data stack to address fragmentation, improve data governance, and ensure consistent, reliable data. What we didn't fully realize at the time was that this was an essential piece of context for LLM applications as well.3
We finally have both data foundations and the AI models to make genuinely useful, reliable AI-driven and data-enriched workflows possible. The explosion of interest in AI didn't displace the need for the modern data stack—it just took some time for these systems to begin to speak to each other.4
Moving forward, we’re going to increasingly see the role of the data practitioner as providing governed, trustworthy data for LLMs and potentially, for using many of those same systems to enable safe, reliable deployment of AI systems.
I feel huge opportunity in this area. I also feel a lot of responsibility for us, as a community, to get this right. These systems are being experimented with, and in some cases deployed today and there is real institutional heft behind their rollout. We have the tools and the (forgive the pun) agency to make an impact on how that happens.
There is a tremendous amount of brainpower that reads this newsletter - people orchestrating the most complex data flows at the largest organizations in the planet and building the tooling that will enable it. There are a lot of unknown unknowns in terms of how we build the bridge from LLMs to our structured data, but I believe we’ve got the right set of humans in place here to begin meaningfully answering this question.
Let’s get after it. Want to talk about any of this? You know where to find me.
Ok, for this one ground truth is “it’s a pillow, stop thinking about it so hard” but you get the point
And danger!
Actually we did realize it, it just took some time to connect the threads
After we ran our first experiment showing the value of a Semantic Layer in natural language questioning - Benn Stancil raised the question of whether we were likely to get “bitter lessoned” as models improved and cut out the need for intermediary interfaces. It’s an important question and one we’ll dive into more in the future. But even systems of arbitrary intelligence need tools! It doesn’t matter how smart an LLM is, if you want an LLM to unload your dishwasher, you’re going to have to put in a robot. Will the same hold true for the three methods of gathering information listed above? Time will tell, but signs point to yes in the near term.
Hi Jason, Apologies for writing this here. I understand this might not be the best place.
I'm working on a product called Quri, a specialised voice agent for newsletter consumption and would love to share about it with you and get your thoughts on it given your expertise.
If it helps I can share the product walkthrough video as well. But yeah, your support would be really valuable and I'd be super grateful. Kindly let me know if we could chat. 😊🤌🏻