
How AI will Disrupt Data Engineering As We Know It
It will be hard to compare data engineering in 2024 and data engineering in 2028 and say “those are the same things.”
Last time I wrote I dove into a bunch of AI advancements that have happened over the past year. Reasoning models, chain of thought, inference-time compute, etc. And there’s more to explore there and I need to return to that series.
But for this week’s issue I want to pause on that and talk about AI from a different perspective. I want to think, as rationally as we can about an uncertain future, about how the job of the data engineer will change over the coming 1-2-3 years as a result of AI.
I am quite confident these changes will be massive. I think the word disrupt is not at all hyperbole—I think it will be hard to compare data engineering in 2024 and data engineering in 2028 and say “those are the same things.”
It just turns out that many of the tasks that data engineers do every day are tasks that AI can provide tremendous leverage in. I don’t know what the % efficiency metric will be—20%? 50%? 80%?—but I think it’s totally possible that it’s on the higher end of that range.
I think that will be both good for data engineers and good for the companies they work for. Data engineers will have more work to do than ever (Jevons paradox at work), but it will be more strategic, add more value to companies, and will likely see them get raises. Companies will get the higher-functioning, higher-ROI, more accessible data systems that have always seemed out of reach.
In this post I want to look at the specific tasks that data engineers spend their time on, and look at how addressable-or-not these tasks are with AI.
Let’s dive in.
The Tasks of a Data Engineer
AI doesn’t replace jobs, it automates tasks. So let’s look at the tasks that someone leveled a Senior Data Engineer most commonly spends time on today. Of course, it should go without saying, YMMV: there is no single canonical job description for a data engineer. But I think we can still get close enough to reason about.
What does a Senior Data Engineer spend their time on?
Create technical artifacts
Landing new data. Building and maintaining automated data ingestion pipelines.
Transforming raw data into bronze, then silver, then gold layers. Includes authoring brand new pipelines as well as refactoring existing pipelines to handle new business requirements.
Defining metrics on top of transformed data.
Writing tests and documentation.
Monitoring costs of data infrastructure and refactor code to optimize performance characteristics.
Reviewing pull requests from peers.
Monitoring production jobs and declare incidents related to either pipeline failures or observability / quality issues. Investigate and resolve those incidents.
Liaise with stakeholders and peers
Answer questions about currently-available data assets like “which data set should I use?” and “can I trust this?”
Collaboratively design changes to existing data assets to accommodate new requirements. Conversations like “what are the edge cases I need to know about when calculating cost of goods sold?”.
Stakeholder enablement and education.
Designing the overall architecture of the DAG, including modularization, team boundaries and ownership, modeling best practices, etc.
I’m sure you could find some other things to put on these lists, but I feel like they’re pretty representative. Feel free to tell me what I’m forgetting.
The Role of Frameworks and Tooling in an AI-centric World
Many of the above tasks are already doable with AI. And I want to talk more about that. But before I get there, it’s important to talk about frameworks, and how important frameworks are to an AI-centric world.
Claude 3.7 will write you almost any kind of code you could want. You can absolutely build a pipeline from the ground up, building ingestion, transformation, testing, etc. in Lisp. In Assembly. In the style of Guido van Rossum. Whatever. You could even imagine a world in which you had 1,000 distinct pipelines and every one was written in a different language or framework or set of conventions. All reading from and writing to a shared corpus of tabular data.
But: just because it is now conceivable to create such a codebase, is it a good idea?
The answer is: no. Obviously not. Just as a team of humans would have an impossible task of maintaining such a Frankenstein, the heterogeneity would make it intractable for LLMs as well.
This intuition pump is helpful to get us to an important conclusion: AI will be more effective as an accelerant when:
a code base is fewer lines of code (less room for error)
a code base is more consistent rather than less consistent: in languages, in coding conventions, in design
a code base uses consistent CI/CD and other developer tooling
a code base uses consistent and well-documented logging / observability
a code base uses well-documented best practices also employed by a large community of users.
In general: code bases that are more concise, more homogeneous, and use standard tools that are well-documented in the model training data (i.e. the public internet) will be more comprehensible by AI systems.
One of the best ways to make all of these things true at the same time is to use frameworks and open standards. Claude 3.7 knows how to build reliably Airbyte ingestion pipelines because the framework is well documented and there are a lot of examples published. It’s also fantastic at writing dbt code for the same reasons. If you’re able to give it an environment where it can test its own code and validate downstream models as a part of its CoT—code quality goes up even further. Standardized frameworks also emit well-understood error messages, which pushes code quality up further.
In short: good frameworks, tooling, and standards are just as important for AI as they are for humans. And the wonderful thing about AI is: it is infinitely adaptable to whatever frameworks, tooling and standards tooling you want to use. No learning curves. Finally the promise of a consistent code base.
How many of these tasks are already doable?
Got it, frameworks are powerful in an AI world. Now let’s look at the individual tasks that data engineers spend time on and try to figure out how tractable they are.
In answering this question I am not going to assume massive improvements in model capability. Even with modest improvements I believe all of this will become true. What is fundamentally needed is productization of currently-available models directed at the specific needs of data engineers, not the invention of new frontier tech.
Creating Technical Artifacts
Ingestion pipelines With nothing but Cursor you can already vibe code your way to a working ingestion pipeline from basically any data source with a publicly-available API. You can already add pagination and solve edge cases and inject instrumentation. It’s unclear, though, if this is actually what is needed. I still fundamentally don’t think most data movement code should be written and maintained within the walls of an individual company—AI or no, I still want to hire a vendor or support a community project. Data engineers shouldn’t be spending a lot of time on this problem today and likely shouldn’t be in the future either. When a custom build is required, AI can already do it well; try it yourself in Cursor today.
Authoring new data transformation assets If you’re using dbt, data transformation is very soon to become heavily AI-enabled. Whether you’re building models, writing documentation and tests, or defining metrics, this is coming to you very soon. We demoed some of these capabilities at Coalesce and will have more to share on Wednesday at our dbt Developer Day. While we are certainly still in the early stages of where we ultimately want to get to, dbt Copilot is already very good at all of these authoring tasks and there is a very clear path to getting even better. Nick Shrock, in one of his best posts ever, called dbt and tools like it medium-code frameworks. It turns out that medium-code frameworks are extremely well-suited for AI. Having personally used dbt Copilot, I anticipate the time required to author new transformation code for data engineers will drop very significantly.
Multi-file refactoring One thing that Cursor now does super-well is stage multi-file edits as a result of a single prompt. You could imagine a similar prompt in dbt: “refactor code in these two parts of the DAG to minimize duplication; combine models where appropriate.” Or: “A new field was added in this data source. Please pull that field all the way through to the DAG into [X] final model.” These types of refactoring tasks are low-creativity but highly time-intensive. Implementing them is product work, not research. The opportunity to get a handle on tech debt with tooling like this makes me giddy.
Automated incident resolution Imagine providing the entire log output of a
dbt run
and the associated project code into a context window and getting back a diagnosis and proposed resolution. While we haven’t productized this experience yet, it’s not hard to experiment with this yourself hackathon-style. Imagine a world in which, following a pipeline failure, a full PR was queued up and run through CI, with a full report waiting for you and just ready to hit the merge button. We should anticipate this type of experience for data engineers in the not-too-distant future. How much time are you currently spending on break/fix? Slash it significantly.
I’m going to pause there because I’m at risk of boring you. Suffice it to say that I truly believe that a) much data engineering work has already been framework-ized, and b) AI will now make creation of, iteration on, and maintenance of these technical artifacts far more efficient. And for the aspects of data engineering that are not yet framework-ized (dbt or otherwise), there will be tremendous gravity towards pulling them into a framework because of the leverage that these types of high-quality AI experiences will provide.
Liaising with stakeholders and peers
There are countless people throughout the business who use data as a core part of their jobs, and data engineers are constantly fielding questions from them. I won’t re-list them all here, but if you’re a data engineer you know the drill. Forever, the hope of “self-service” has been the hope that these data users would not need to lean on data engineers in this way—these interactions inject friction and slowness that neither side wants.
This fully actualized self-service has never actually materialized, and the status quo has been frustratingly persistent. But I’m optimistic that we have more of a path today than ever.
The easiest thing to do for any technology vendor at the very onset of the AI era was to take all of the domain-specific context that you had and surface it to users in a chat interface. And we did the same thing. It was (and is) quite good—it does a great job of allowing users to ask business questions and answering them with semantic-layer-governed responses.
The problem with this approach is that users don’t actually want to interact with dozens of chat interfaces. They don’t want to remember to go to a given tool to get one type of answer and another tool for another type of answer. There will not be 30 chat experiences all with different context. There will be one…or maybe just a few. But likely a single dominant one.
This is how aggregators work. You likely don’t use a bunch of different search engines—you probably just use one, and it is probably Google. This is how chat will go as well.
The problem is, Google could scrape the web and respond to all queries based on that knowledge. But ChatGPT cannot know all of the information you want to ask it questions about (at least, yet). That lack of business context is the problem.
That’s where a context protocol comes in. A context protocol—a somewhat new topic in the public AI conversation—is a standardized way for services to provide additional context to models via an open protocol. The most promising one today is called MCP, but whether or not MCP wins, the awareness/excitement/support for this idea has developed a ton of momentum and I am fairly convicted that something like this will become real and widely-supported.
There will be a large number of context providers (every source of valuable enterprise context) and a large number of context consumers (different products with AI capabilities). There is no way to create point-to-point integrations to facilitate this. A protocol will be needed if we are going to see the right type of advancements, and I think it will happen.
Imagine that your license to ChatGPT enterprise or Claude Desktop or whatever already came with a connection to all of the metadata about every piece of structured data you had access to. What was there, how trustworthy it was, how suitable it was for the analysis you were describing, etc. I think that, very quickly, you would find yourself asking questions of your friendly AI rather than shoulder-tapping your colleague in data engineering.
That’s not to say that the existing relationship would go away, but I do think that this would represent a true reset of the working relationship between data engineers and downstream business stakeholders—one that both sides would benefit from.
Where does that leave us?
Over the past two years, critical innovations have been made in foundational AI technology. Chain of thought, reasoning models, inference-time compute, agentic workflows. These are the ingredients needed to build the AI-enabled data engineering future. But they are now here.
And open frameworks—from dbt to Spark to Airbyte to others—have become widely deployed. This makes it possible to create great framework-specific AI tooling, both for the commercial stewards of those frameworks (including us), but also by any other vendor.
The commercial incentive to innovate here is high, and there could not be more attention on delivering these types of benefits within companies of all sizes. This is going to happen, and data engineering as a profession is never going to be the same.
So what? Time to get a new job? Data engineers are obsolete?
Hardly. Data engineers, one of the hottest jobs of the last decade, will stay hot. But practitioners will be pushed in one of three directions: towards the business domain, towards automation, or towards the underlying data platform.
Data Platform Engineers will become ever-more-important. They don’t spend their time building pipelines, but rather on the infrastructure that pipelines are built on. They are responsible for performance, quality, governance, uptime.
Automation Engineers will sit side-by-side with data teams and take the insights coming out of data and build business automations around it. As a data leader recently told me: “I’m no longer in the business of insights. I’m in the business of creating action.”
Data Engineers that are primarily obsessed with business outcomes will have ample opportunity to act as enablement and support for the insight-generation process, from owning and supporting datasets to liaising with stakeholders. The value to the business won’t change, but the way the job is done will.
You’ll hear a lot more from us on Wednesday about how we’re making this future a reality for dbt users. I’m excited to disrupt the decade-long status quo and build something better.
- Tristan
Tristan, it sounds like we’re dealing with two competing realities:
- AI handling ingestion end-to-end,
- versus paying a SaaS vendor just in case.
We love automated maintenance which is why we already did it without AI. And while claude 3.5 was pretty good at building dlt pipelines, the latest Claude can even use our incremental logic, parallelism, etc to build efficient pipelines.
We even generate dbt scaffolds programmatically and evolve schemas, so really any maintenance would happen in dbt directly. https://hub.getdbt.com/dlt-hub/
So given building and maintenance are already near-free, what do you think will be value to let a vendor run the code AI writes for you?
I saw MCP mentioned on the Changelog podcast recently…interesting to think about it as the framework to drive aggregation/integration
Also, kind of love the term “medium code”. Low-code tools just aren’t cutting it, and it’s so frustrating not to be able to have AI generate something I can copy+paste into them