How to build a resilient DAG
Managing data complexity with orchestration that happens at the rhythm of your business
This week I’m thinking in public.
Our data team is preparing to invest in some much needed complexity busting to get our SLI for key company metrics freshness back to 🟢. There’s a 3-step framework I love that helps reason about DAG design in a way that helps with both timely data SLAs and in staying resilient to frequent changes in the business.
I was going to write it down in YAID (yet another internal document), but there’s no reason it can’t be public. In fact, it might even be better off as a blog post — it will force me to stay at a high level and avoid trying to fix the problem myself! ;)
Hope you find it useful!
But before we get into the big post, here’s a few other cool things I read this week:
Why Pandas and Numpy return different values for standard deviation by Avi Chawla. This has tripped me up before when moving workflows from R to Python, and is a great reminder to read the docs to understand functions you’re using to develop insights!
Data team as a % of workforce by Mikkel Dengsøe is a very insightful look at the makeup of data teams of 100 companies in different regions and industries. The size of the dbt Labs internal data team, for instance, is squarely the median of US companies and that feels about right!
A really in-depth look at building a marketing attribution model from scratch with snowplow analytics by Randall Helms. Congrats on the new gig, Randall!
Part 2 of SeattelDataGuy’s State of Data Engineering is out, and dbt is in! In demand, that is, among the readership of the blog. There’s some cool other insights in here about when organizations start adopting data catalogs, and the state of testing in the modern data stack.
How to build a resilient DAG
Managing data complexity with orchestration that happens at the rhythm of your business
The problem: the volume of data your DAG needs to process in a dbt job has grown to a point where your daily company metrics are no longer reliably landing in your warehouse on time for the business to use them.
The problem behind the problem: your DAG has grown organically alongside your business, and things that felt easy at a couple of dozen models start to feel challenging at hundreds of models.
Requirements:
continue to follow the best practice of using
ref
functions to allow dbt to automatically construct and re-run your DAG (i.e. no cheating with APIs and hard coded orchestration steps! ideally, all of this works on a standard dbt Cloud instance)data freshness needs to be in step with the rhythm of your business. In other words, it should be available as soon as it needs to be used in a business decision, but no sooner (I’ll explain what this means in a second)
denormalization and other forms of data duplication are used sparingly, and only to maintain necessary historical context
no more than a day goes by before being alerted about business critical data quality issues
when data gaps/mistakes/inaccuracies inevitably occur, the minimum amount of re-building is required to bring the entire knowledge graph back to 🟢.
So, how do you build a resilient DAG? I’m going to help you do it in 3 steps:
Step #1: Identify the rhythm of your business
Before you tear apart your DAG, it’s helpful to do an exercise first that maps out how your business uses your data today. I promise it’s quicker than it sounds! There are a few guiding questions you can use to help structure your new knowledge map:
❓ What business scenarios require information delivered within minutes?
For us, this looks like:
tracking a usage metric during a new product or feature launch
end of quarter net new ARR reporting — there’s something very special about the entire company watching deals closing in near real time as the quarter comes to a close
the value of an uptime service level indicator if it is not meeting your service objective.
All of these examples share this in common: they’re time bound to a week or less. For instance, once a couple of days go by after a product launch/some other big milestone business event, the excitement usually dies down and reporting out daily data is often sufficient.
It turns out that most businesses don’t need to query highly granular low latency data (think up-to-the-minute insights) beyond the past 7 days or so. Why this is so useful: if your cloud warehouse charges you based on the number of rows queried, you can safely make a certain number of days of highly granular event data available for direct querying and put the rest in cold storage (how you do this will vary by warehouse, but here’s an interesting example on a public dataset that illustrates this point well).
❓ What business decisions can safely react to information that is less granular than a day?
In other words, when do you not need daily data at all? For us, this is usually after a business time period comes to a close. A couple of examples:
We use monthly business reviews as a key accountability mechanism for important company metrics. This means that after the close of a month, we generally want to “finalize” metrics related to the past month’s revenue, product usage, and the state of our product funnel. In these conversations, once a period passes, we reason in months and not days. And unless we got these metrics seriously wrong, we generally don’t need to compute them again and again.
Similarly, we’d want to lock down the values of important metrics for a past quarter in preparation for a board meeting, and once we reach the end of a fiscal year to allow us to visualize trends on a longer time horizon.
Why this is useful:
you’ll be able to reduce the frequency of computations on large sets of data to key business time periods: end of month, end of quarter etc.
users of your data model will see less data drift and trust your data more— if values of an older metric change, it’s because you’ve explicitly decided that it needed to be re-done. This is incredibly useful for anything that needs to be audited.
Doing this exercise helps you identify the right strategy of modeling your data for a given use case before you write any code, and make the problem of making your DAG more resilient much more tractable.
Step #2: Allow your jobs to run independently of one another
If you have a massive DAG, odds are you aren’t running enough jobs or spending enough time thinking about how to parallelize your work. There are two parts of your existing DAG that probably need to be carved out into their own jobs if they aren’t already:
Event data is always going to be your largest and slowest job. It’s also the type of data that usually needs to land in the warehouse the quickest to power up-to-the-minute insights. Set up incremental models for event data as early as possible (don’t wait for this dataset to get big!), and run these jobs as frequently as you can. Ideally you never want to do a full refresh of your incremental model, so you’ll want to do as little transformation and enrichment as absolutely necessary on this data. At worst, you will want to re-run certain time intervals if you know the source data was problematic in some way. Since you aren’t going to query this type of high granularity low latency data for very long time periods, it’s OK if it’s not perfect — just make sure you can join it to other useful things! ;) If you have multiple event sources, you might even consider setting up separate jobs. For example, if you have web sessions data and telemetry from your SaaS product, you can usually safely run them in parallel.
Snapshots are most useful when they are done consistently, and roughly at the same time interval. Snapshots are likely containing things like copies of your engineering database containing user information, and will power dimensions of your most important business entities. It matters less how frequently you snapshot — what you want to avoid are unpredictable delays. This is another reason to keep your snapshots as a separate job — this way they aren’t held back by a massive event materialization, for instance.
Design the rest of your DAG for idempotence and reserve most of your data testing for further downstream. For example, your daily jobs will likely pull together context to enrich your event data, and do what is needed to compute your daily metrics. This is the ideal place to put in place some sort of variance testing — how confident are you that you have a full day of data? If you aren’t, you likely want your daily job for that particular area of the business to fail while you investigate and correct the issue.
Remember that you don’t need your daily job to also run monthly, or other longer time interval calculations. This means you also want different tests on monthly and longer time interval jobs. Your testing rigor should be inversely related to how frequently your job is run — the fewer times you want to re-run a computation, the more rigorously you should be evaluating your data quality. At monthly intervals, for instance, you can start applying trends to help you understand if the data is within some expected variation (I personally like year over year comparisons for this) — if not, you will likely want to understand if this is a bug or a big change in the business.
Next, we’ll talk about how to design your tables to enable this sort of flexibility.
Step #3: Don’t pre-compute numbers before you need them. Seriously.
Ok, this is not technically a step but you will want to spend some time with this principle as you design your shiny more resilient data model. This principle is all about avoiding “cubes”.
Outside of your event/snapshots jobs and your end-of-period business close jobs (monthly, quarterly etc.), what you want to achieve is a maximally flexible system that can be layered on in regular increments to power new and interesting combinations of metrics and exploration.
This is where strategically denormalized data incremented by something like a day really shines. Here’s a very simple example to illustrate what I mean:
In the above example, you can find the unique IDs of every business entity that did something on your platform, grouped by a few bits of context you always want to have (let’s say they are important segments for your company metrics). There’s one row per activity per entity that had activity on a given day. With this data structure, you can quickly count the distinct number of new signups over an arbitrary time period (say, monthly to help with our business reviews!), the number of users who viewed your site on a given day in a specific region, isolate the user activity of user #156432 who is having an issue that support is trying to debug, or several other permutations of useful metrics and insights.
This table is not designed to have all of the context behind an event (e.g. every single page view if a user has 100), but it should allow you to compute a big chunk of your key metrics with something like the dbt Semantic Layer with no additional joins. Most useful more in-depth insights should be an easy join away.
Assuming you have daily tests that you can count on to fail when a days worth of events looks funky, if you use this kind of lightly denormalized approach, you won’t have to re-run a lot of jobs / a large DAG to recover from an error. You really just need to replace the last couple of days where something went wrong.
The best part: the next time someone comes along and asks you to slice something in a different way on that dashboard, you’ll be able to say — yeah, no sweat, that’s easy!
There’s lots more to say here about how to handle failures and re-run this kind of architecture in a dbt-esque way, but that will fill up an entire other blog post. Perhaps you’ll read about this part of our DAG transformation (pun fully intended) from the folks who will be working on making these improvements on our internal team! ;)
Until then, take care and thanks for reading!
"The size of the dbt Labs internal data team, for instance, is squarely the median of US companies"
Looking at Mikkel's article, isn't dbt Labs #1 at 7.3% while the US median is only 2.2%?