meme credit:
We are currently accepting proposals for qualified speakers to join us at Coalesce 2024 in Las Vegas! Coalesce is the single biggest stage for you to highlight the amazing ways that you and your data team are innovating today.
Submit your ideas before the 4/3 deadline closes!
Long before my first dbt run
, before SQL had invaded my dreams, I was working my first job as a research assistant in a university’s public health department. One of our projects was a literature review of the mathematical models of tuberculosis progression.
Turns out, lots of people had used mathematical modeling to understand how people come in contact with, contract, and spread tuberculosis. In theory, at least, arming ourselves with knowledge of how disease spreads should keep populations healthier. The issue: with so many estimates, how do we know which model is right? Which one should policy makers trust?
My team’s job was to pore over these models, extract their assumptions, and collate them in the world’s greatest BI tool, Microsoft Excel. We reviewed over 300 of these models and modeled the models. That way, we could summarize what the assumptions in the papers suggested about tuberculosis in the real world. The results were, well, a bit disjointed!
Predicted tuberculosis incidence varied … annual incidence varied by several orders of magnitude, and **20-year cumulative incidence ranged from close to 0% to 100%…**modelled results were inconsistent with empirical evidence … 40% of modelled results were more than double or less than half the empirical estimates.
In other words: some studies indicated effectively everyone would have tuberculosis before too long, and others predicted that we’d be an effectively tuberculosis-free society pretty soon! (I, for one, am glad to be living in a world that’s a bit closer to the latter prediction.)
This is certainly not meant to knock the quality of individual studies—modeling disease is an incredibly difficult problem! Each study advanced our collective understanding of disease progression and prevention. However, not until we zoomed out a level and did an analysis of our analyses did a clearer picture of what we knew about the world emerge. “Going meta” made us realize we had a long way to go to truly reach the truth.
Know thy DAG
Analytics engineers are tasked with encoding their organizational knowledge into data models, and producing analyses that show the clearest possible picture of the world they operate in. It’s vital work! Undoubtedly, data teams deepen their organization’s understanding of themselves, and decision makers are better off with the knowledge data teams produce.
But how often do we as data practitioners take a moment to zoom out and interrogate our methods? Is the way we’re encoding knowledge consistent? Can we improve our knowledge system in any meaningful way? Do we have any knowledge about the knowledge we’re generating?
Turns out, if you’re using dbt, you’re generating metadata about your data constantly—every time a dbt command is run, dbt’s artifacts contain a rich map of the knowledge we’ve built. There’s a lot of wisdom in there if you take the time to read the tea leaves. The metadata in these artifacts allows you to judge:
Quality - your dbt
manifest.json
has so much metadata. Parsing and analyzing this artifact can indicate untested, undocumented, or fragmented areas of your DAG you can target for improvement (or even removal!)Performance - analyzing your
run_results.json
artifacts can clue you in to test failure rates, long-running models, and the like, helping identify and improve the stability of your data pipelines.
Going meta is the only way to truly know the full shape of your organizational brain. Until you make use of the metadata at your disposal, you’re seeing only parts of the whole.
CSVs all the way down
The dbt package dbt_project_evaluator
was born out of this very problem! My esteemed colleagues on the dbt Labs professional services team and I were frequently tasked with auditing customer projects and making recommendations on how they might introduce our best practices for knowledge generation with dbt. Reader, it was a damn slog. We were effectively as powerful (arguably less?) as I was in 2016 with my Excel sheet—we were left to comb through dbt’s metadata by hand. While dbt itself had rapidly advanced the way data work gets done, our ability to analyze the way we were doing data work lagged far behind.
We built dbt_project_evaluator
to automatically translate this metadata into the shared language spoken by all data practitioners: SQL baby! For each recommendation we make to clients, a dbt model returns exactly which resources in your project need attention. The package groups its dbt models by the type of meta-knowledge you may want to explore:
Modeling - Does your dbt DAG use modeling best practices?
Testing - Are your data products well-tested?
Documentation - Are they well-documented?
Structure - Is your dbt project file structure organized? Do you use naming conventions well?
Performance - Are your model materializations set up to support solid query performance?
Governance - Are you using dbt governance features properly to support a dbt mesh?
This is hardly an exhaustive list—there are likely more rules and more categories to be added as we continue to use dbt metadata to deepen our understanding of analytics engineering.
dbt Cloud makes this process even easier — dbt Cloud is essentially a metadata engine, parsing all your metadata to surface it to you in accessible and digestible ways. dbt Explorer now surfaces many of the same recommendations as dbt_project_evaluator
, as well insights about the longitudinal performance of building your knowledge graph from your run_results.json
artifacts. If a dbt model is an academic paper, dbt Cloud is your academic library and dbt Explorer is your reference librarian + literature reviewer, empowering you with the essential knowledge about your knowledge.
“Metadata”, at least to me, can feel a bit vague. Much of the time we’re content to just treat it as exhaust from the tools we use. But turning our full attention to it, even momentarily, can lead us to a better understanding of the very thing that produced the metadata in the first place. “Metadata” is really just more data at our disposal; it’s turtles all the way down.
This newsletter is sponsored by dbt Labs. Discover why more than 30,000 companies use dbt to accelerate their data development.