A Dispatch from the Jagged Frontier of Analytics Engineering

What works today and what doesn't.

Apr 26, 2026

We’ve been very big picture over the last several roundups. Agents will be the primary consumers of data, it’s time to move up the stack, etc etc etc. That’s all still true, but I think it’s also important in this time to be extremely grounded in what the reality of the current moment actually is.

So today I’m going to go over, firsthand, the current areas where coding agents succeed and fail in complex analytics engineering tasks. For this I’ll be walking through a hands on case study, but if you’re interested in how we’re doing this from a benchmarking perspective I recommend reading Benn’s recent post on building a better data agent benchmark.

It all started when I got on a call with Benoit earlier this week to talk through some modeling work he’s been doing as part of a deep dive on how to improve the dbt MCP server. I asked Benoit to spend some time with our event data and work on sessionizing MCP usage as well as categorizing the different sessions, so that we could get a sense of what the patterns of engagement are.

This was a thorny task and it hit right on the edge of “a seasoned analytics engineer is sped up by the agents quite a bit, but the agents certainly could not do it out of the box and if done naively could have actually caused some real issues”.

It’s a great illustration of Ethan Mollick’s jagged frontier as applied to analytics engineering: the idea that LLMs have uneven distributions of how good they are at various tasks and that it’s nonobvious where the demarcation line is. The output looks confident and plausible on both sides of it, which means you can’t tell the model was on the wrong side of the line until you go check. That general shape is true for basically every field people are using LLMs in right now.

What is interesting for AE work is that the specific shape of the frontier looks different from what it looks like in, say, software engineering. The peaks sit in different places, the troughs sit in different places, and the ways you work around the gaps require data intuition.

The shortest version of what Benoit and I saw on our call: today’s models are very good at the standard nuts and bolts of analytics engineering work. Yet they break down when there are unknown unknowns in the source data, when the solution is outside their current frame of reference, and when performing complex operations across a large DAG.

What works great today

The strongest thing models do in AE work right now is the kind of modeling task that fills the middle of most of our weeks. Benoit’s current project is a good illustration.

A little context. Benoit has been digging into the data we have available to us from the dbt MCP server, trying to figure out how people are actually using it. What flows show up most often, which ones are long, where tool calls cluster, where users drop off. One of the first things you need for an analysis like that is sessionization: grouping the individual tool calls into units of related activity so you can reason about them as coherent sessions instead of as an undifferentiated stream of events. The raw data doesn’t ship with that grouping, so you have to derive it from the timing of the calls, setting a threshold on how much idle time counts as the break between one session and the next.

Sessionization is a very standard dbt modeling exercise. You order events per user, compute the time gap between consecutive events, flag a gap above your threshold as a session boundary, and then propagate a session ID down the ordered list. It needs window functions, some lag and lead logic, and a handful of CTEs stacked on each other. It isn’t complicated, but it is fiddly to write, and getting one CTE subtly wrong cascades into garbage for the rest of the model.

Benoit’s agent wrote the whole thing in one shot. In his words, the LLM “one-shotted it with the five CTEs needed to do the lead, the lag, add the session ID, and everything.” The SQL was clean, the logic held up, and the result was a usable model he could build on.

The pattern here goes well beyond sessionization. The same model that one-shots a sessionization build will also one-shot a slowly changing dimension table, a deduplication model, a rolling-window aggregation, or a reasonable first cut of staging models from a well-described source. The tasks that sit in the middle of an AE’s day, the ones where you already know what you want to build and the work is mostly getting through the fiddly SQL of actually building it, are, in April 2026, within range of the agent doing most of the keystroke work for you.

What changes when that becomes true is surprisingly hard to appreciate until you’ve spent a week in it. For better of for worse, you start writing models you wouldn’t have bothered with before. You try three versions of a transformation instead of committing to the first one that comes to mind. The middle of your workday starts to feel exploratory in a way it hasn’t in a long time.

And the modeling speedup is only part of the shift. Once you have an agent that knows the shape of your dbt project, a bunch of adjacent work collapses in the same direction. Data profiling, the kind where you’d normally write a quick throwaway query to check nulls or look at a distribution, drops from a three-minute exercise to a ten-second one. Benoit: “I can just ask it, hey, this column looks weird, check how many nulls there is in this stuff, and if there is a date from which it started to get null or not. I would have written this query in three minutes. Well, it does it in ten seconds.”

Schema exploration becomes a conversation rather than a query-writing exercise, and cross-referencing against other models in the project, if you ask, becomes something the agent just does. At one point on our call Benoit mentioned to the agent that a mystery ID might be a service token rather than a user ID, and it went and found the completely separate service tokens model in the project, confirmed the hypothesis, and traced it back to the account making the call. None of that is magic, but all of it changes the feel of doing the work.

Beyond the edges of the jagged frontier

Now the other side of the frontier.

The thing that tied together the areas where the agent struggled during Benoit’s work was that the task required something the agent couldn’t get from the code or the docs in front of it. Sometimes that was company-specific knowledge about how our systems are actually shaped, which doesn’t live in the dbt repo at all. Sometimes it was reconciling data across sources that had grown up with slightly different assumptions and nobody had ever forced into agreement. Sometimes it was the kind of judgment call about what a piece of data actually represents that a thoughtful human analyst would pause on, and that the agent, right now, doesn’t.

The example Benoit and I kept coming back to is a user ID story. We have an internal event pipeline that ingests events from tools like the dbt MCP server and lands them in our warehouse. This service emits events tagged with a user ID. Our analytics layer, on the other hand, derives a different user ID by hashing the raw ID with tenant context, because we run a multi-tenant deployment and the raw IDs aren’t globally unique across tenants. Two systems, same field name, different meaning.

The agent didn’t know this, and it couldn’t have. Nothing in the code said so, and nothing in the docs said so either. The knowledge lived with engineers who had worked on the APIs, or with people on the account and support teams who had seen the collisions show up in production. Tacit knowledge, the kind that is invisible until it breaks.

The SQL the agent wrote was clean, the joins were wrong, and the tests all passed. The bug only surfaced when Benoit asked the agent to look at the distribution of user IDs in the data and a small number of IDs turned out to be appearing with wildly anomalous frequency.

The lesson here goes well past the specific bug. Benoit put it cleanly on the call: “the LLM might take some assumptions that, if I had written the code myself, I would have thought about. Maybe I would have stopped and said, okay, I need to check this. But when the LLM wrote it, it looked fine.” The SQL being correct is just one part of the data being correct. Agents are extremely good at the first axis and variable at the second, and the gap between “the code runs” and “the data is right” is where data incidents tend to live. (Of course, if you want to get guaranteed deterministic answers from an LLM - there is a way to do that!)

Some other things to watch out for when doing data work with today’s models:

Cost awareness. Benoit’s agent ran a few LLM-powered model functions without asking him, burning a meaningful number of tokens in the process. It had no sense that those particular calls were expensive, and no instinct to check before running them. That’s small inside a single session and not small in aggregate across a team over a quarter.

Tool-setup friction. Benoit needed to check something in Datadog. If he’d had the Datadog MCP server connected, the agent could have done the search for him, but connecting it would have taken maybe ten minutes and doing the search manually took five, so he did it manually. The local cost-benefit math on proper infrastructure setup tilts against you every time, and six months later you realize you never built any of the connective tissue you were supposed to build. I think a lot of us are quietly making that trade right now.

Shifting left, shifting right

Ok so here’s where we land as of today:

Writing models is easier and faster, perhaps much faster (Benoit estimates a 2x to 3x speedup although as always repeat after me “self reported productivity speedups are often unreliable and need to be verified with other mechanisms”). But it’s hard to believe a speedup isn’t there.

So what does an enterprising AE do?

According to Benoit, the you should anticipate shifting leftwards or rightwards in the DAG “the left part of the DAG requires more knowledge of the company’s data systems. The right part of the DAG requires more understanding of the business, and what should ARR be, and how we should consider a user active or not.”

This is not to say that you abandon the core analytics engineering work! That’s incredibly valuable work that is getting easier to do and your first priority should be doing the things that are easy and high leverage right now.

It is to say that you should know where the likely challenges will land you as you start to get the “easy” stuff under wraps. And of course, the easy stuff still contains plenty of complexity that we’ll write about in the future (reducing model bloat, ensuring your agents follow AE best practices, spend management etc).

So what is Benoit looking for next to make complex analytics engineering work easier?

In six months he’d like to see agents that:

…check their own assumptions about data—or surface them to a human!—before acting on them.
…are aware of which of their tool calls are expensive and ask before making them.
…have lower friction connecting to all of your data across your stack since the more information they have, the more useful they are.

A lot of things moving very quickly! Would love to hear how this compares to hands on reports from all of you. Thanks for reading.

The Analytics Engineering Roundup

Discussion about this post

Ready for more?