Lessons from the warehouse
Structure versus Process
Decentralization has haunted the tech zeitgeist for quite a few years now, but with the advent of better distributed query engines and data formats, it has grown from a vibe in the ether to fully manifested in the collective dreams of analytics.
The thing is, we’ve invested a lot of thought in how to do centralized warehousing — so how do we surf this cycle productively? How do we make of the ecosystem’s respiration, its bundling and unbundling, something that moves us forward? We need to dig into the heart of what we’ve learned over the past decade of the modern data stack and find the lessons to hold on to as our data glaciers melt into diverse, vibrant watersheds.
One aspect of centralization with a bad rap, the thing people most point to when extolling the virtues of decentralization, is process. Reed Hastings wrote an entire book about how much he dislikes it. Meta’s “move fast and break things” has ascended to a household cliche signaling Silicon Valley’s disdain for the stuff. It’s generally regarded as a fungus that feeds on centralization, spreading as it compounds.
There’s a lot to learn from looking at mycelium though. Could there be value, buried in this maligned outgrowth of centralized system concentration? I’ve been doing Internal Family Systems therapy the past year, which has a motto of its own: “no bad parts”. The idea is that even our most counter-productive impulses are motivated by some kind of reasonable response to the conditions in which they were formed. They’re trying to serve us in some way, and if we can figure out what that is, we can direct them in a more effective direction. Certainly people are drawn to creating process for a reason, so can we tease out the good within process and find something worthwhile to bring with us into a more decentralized future?
To do that, we first need to understand what exactly process is. I’ve thought about this a lot in trying to build better ways of working at dbt Labs and consulting for data teams, and the definition I’ve landed on is this:
Process is layers of structure with controlled gates to move from one layer to the next.
So, what’s so slow about it? If you work with data or software, you’ll probably spot it immediately — it’s the control mechanisms to move through the layers of hierarchy. We centralize power in a relatively small number of people or other processes which are invested with the authority to move work through the tiers of a process. Centralizing power in this way slows work down through the creation of bottlenecks.
Enjoying this discussion? Subscribe to the Analytics Engineering Roundup for more.
Think about a dbt DAG, and for simplicity, let’s say we’re running 1 thread. We’ve got two unrelated models called
customers and and
events. Now even though they could, in theory, run in parallel, practically they can’t; one will have to wait until the other is built and tested to start running, because we can only operate on one thread at a time. Our empowered decision makers are like our threads when we create process.
Now imagine we throw the thread count wide open to 128 threads, but we implement a set of automated tests, pre-commit hooks, linters, CI/CD, and a formatting tool that enforces our style guide. Now we’ve created more structure, but minimized process.
Structure, the actions and elements of a single layer of process, sans the control mechanisms and hierarchy, is a positive force.
Structure is the actions and elements of a single layer of process, without layers of hierarchy or control mechanisms.
As we can see in the above example, it creates safety, speed, and clarity. I know the form my work should take, I have tools to quickly and automatically ensure it takes that form, and I can more easily cooperate because I know the same form will apply to the work of my teammates.
These clear expectations also create a more equal distribution of power, which increases velocity. When I know how and where to build what I want to build, and am confident I can ship it, I move faster and with more purpose.
While the popular belief that process slows things down may hold true, we’re crucially missing that structure is not process, and structure helps more people go faster. If we can combine the clarity of integrated structures with the speed of distributing storage, compute, and control, we’ll get more out of the shift underway than wantonly throwing out structure alongside process and centralization.
So how and where should we apply structure (not process!) to our ever-decentralizing data platforms? If we’ve learned anything from the past several years of the modern data stack, it’s that a single source of truth is a powerful north star vision. If we could keep one thing from this centralized data warehouse era, I would argue it should be this. Our efforts then, should be pointed towards creating structures that more easily create unified meaning across diverse data sources and compute. Process-reducing, structure-creating tools like the semantic layer, declarative orchestration, and data contracts become even more valuable in this future. The work in front of us as analytics engineers is to build systems in which a single source of truth emerges cooperatively.
From around the web
A brilliant piece from Stephen Bailey about bringing data, as a team and a type of resource, more fully into an organization. A goal like this, making data a more purposeful and impactful function that feels more intrinsically embodied in a company, is one of the most compelling reasons for exploring localizing data work more closely to where it’s produced and used.
In his article on transformation for AI, Benn Stancil discusses evolving new, more standardized and machine-friendly formats for LLMs to consume. Structural specs like Activity Schema could be one path towards such an approach. Teghan Nightengale has been excited enough about its potential to share an open source dbt package to make building on it easier.
A big one for both the centralized present most of us live in and the distributed future. If you’re going to exert less top-down control, you better have technical patterns in place that are efficient. Niall Woodward’s put together the best analysis of CTEs in Snowflake written to date. If you’re running dbt on Snowflake it’s a must read.
dbt Community member Manish Ramrakhiani is getting ahead of the curve and the amazing work dbt Labber Sung Won Chung has been doing on multi-project dbt and data contracts — exploring these concepts using the tools that exist in dbt today. A fantastic example of thinking about structure-driven decentralization.
Arrow is in the headlines recently with the release of pandas 2.0 now powered by a much faster and more inter-compatible Arrow backend. Great news for analytics engineers trying to choose between building dbt python models with the venerable pandas or the newer Polars — now that they’re both Arrow-based - why not both?
New Iceberg-based distributed storage platform Tabular’s blog has also been great for getting excited about features of the format, like tags and branches.
Thank you for reading this week’s Analytics Engineering Roundup. Please feel to share! I hope you’re having a wonderful present, and I look forward to seeing you in a more empowered and impactful future!
Subscribe for free to receive new posts every week and support the newsletter.