New Podcast Episode!
In Episode 5 of the Analytics Engineering Podcast, Erik Bernhardsson joined Julia and me to talk about the fundamental infrastructure challenges that drag down the productivity of data teams (and, of course, the Spotify recommendation system that he invented).
If you're enjoying the show, I’d be forever grateful if you’d leave a review on your platform of choice.
Enjoy the issue!
How Contracts Enable Distributed Ownership
Let’s start by wading into Data Mesh Twitter…I think Benn summarized my own thinking on this topic better than anyone else I’ve yet read on what data mesh even is:
The original definition is a fairly complex set of architectural requirements. Its pillars are technical, and, as data mesh creator Zhamak Dehghani made clear, quite specific. The second definition—the descriptivist one the community is gravitating towards—is more vague, essentially claiming that any decentralized architecture in which teams are responsible for their own “data products” is a data mesh.
My interest here is in the second definition: a decentralized architecture in which teams are responsible for their own data products. The technical definition doesn’t resonate with me (although it’s not clear to me whether that’s more about me or the idea itself). But the distributed-ownership-of-data-products-by-distinct-teams thing… well, that’s important. In fact, I’m becoming increasingly convinced that distributed ownership of data products is critical to the future of our industry.
Why would this be true?
The corollary to “software is eating the world” is “software is eating the org chart.” Every company is a software company today, in a world where an ever-increasing percentage of customer interactions (of all types) are mediated by tech. This means that, in a 10,000-person organization, there isn’t (cannot be!) a single control point that monolithically drives all application development. Instead, there are many teams responsible for building technology platforms and user-facing products.
High-functioning technology organizations largely devolve ownership of decisions down to the team level, or as close to the team level as they can get, to ensure that the folks who understand the problems are empowered to create appropriate solutions for them. Teams can then identify, write code to address, and ship solutions to solve customer problems without getting approval for their decisions from some central authority that would inevitably make the process slower and worse.
It is generally desirable for these codebases to work together, creating unified customer experiences and reducing the number of software engineers doing duplicative work. For this to happen, they all have to interface with one another (in a technical sense)—they have to read/write each others’ APIs or otherwise share data and functionality.
All of these teams and their component humans are, in essence, jointly evolving a single collective codebase. Even if engineers are working across hundreds or thousands of code repositories, there are expectations that these repositories will interconnect with one another in reliable ways. They’re all part of a … dare I say it … mesh.
What we observe here is distributed production and governance of a shared resource: software code.
I share all of this because what we are actually looking for, yearning for, as a data industry right now is distributed production and governance of a shared resource: knowledge. There are both knowledge producers and knowledge consumers distributed throughout modern organizations. All centralized / top-down modes of organizational control over what constitutes knowledge have failed in predictable ways (velocity, responsiveness, general dysfunction). We lack good end-to-end distributed production / governance / consumption paradigms currently to enable all stakeholders to jointly steward this resource.
I’ve been obsessed with this problem for a few years now. We (dbt Labs) started working with large organizations (10k+ people) several years ago and have been running into the limitations of the state of the art of distributed ownership in data ever since then. I’ve always wanted the answer to “how do I distribute ownership of my data models?” to be “just make multiple dbt projects that all work together” but today it just doesn’t work…as well as you’d hope. Amy Chen details why here. It’s become painfully clear to me that the solution is not extremely straightforward, which is why I initially started looking around for inspiration.
If we’re trying to achieve something that software engineering has already achieved, what can we learn from its model? And, what from its model will likely not apply?
The question is too big to answer in this newsletter…I really need to sit down and write a less stream-of-consciousness post about this. But let me riff for a minute on why I think it’s all about contracts.
Here are some of the problems software engineers have faced in distributing ownership of software code.
Problem #1: Things Breaking
This manifests in two ways. First, bugs inside Team A’s code can have cause unpredictable (and hard-to-trace) downstream bugs in Team B’s code. Second, changes in Team A’s code (that are 100% intentional) can cause similarly unpredictable and hard-to-trace downstream bugs in Team B’s code.
Software engineers deal with these problems in a few ways.
Extensive testing and code coverage metrics. If you can’t show how battle-hardened your code is others won’t trust it.
Semantic versioning, version-aware package managers, multiple cloud API versions, and upgrade / deprecation procedures. Team B doesn’t allow Team A to just change the functionality of their dependent code without making a proactive, well-considered decision to upgrade to the newest version.
Public / private interfaces. Downstream code is forced / encouraged to use specific integration points / APIs that are intended to be supported in a stable way. Libraries maintain internal logic and state that they can change without fear of breaking downstream dependencies.
Problem #2: Infra and Deployment
If Team A maintains a codebase that Team B interfaces with, Team B not only needs to know how to build a test environment for their codebase…they also need to know how to build a test environment for Team A’s codebase. Often this requires Team B to know how to deploy Team A’s code!
This is one of the great benefits of the Docker + Kubernetes combo. In an idealized version of the “infrastructure as code” world, Team A actually ships their code along with a Docker image and Team B can incorporate that into their own build tooling.
Problem #3: Knowledge about the Code
If Team A’s codebase is used by Teams B through Z, and every time those teams have a question about it they are forced to ask Team A for help…well, Team A is never going to get any work done. Software engineers have done a ton of work to help downstream users of codebases understand how to correctly interact with their work:
Automatic documentation, from things like javadoc to swagger and automated API documentation products that create amazing developer experiences.
Conventions that emphasize writing programs as a mechanism to communicate about their functionality to other humans.
There are certainly many things I’ve left out of the above list whether to keep it from being a book or because of my own ignorance. But the overall picture looks something like this:
Team A needs to make good on certain contracts in order for Team B to be able to reliably consume its code. In data, today, we are far more capable than we used to be of providing these contracts to our downstream consumers. But there are three big areas where we fail on:
Semantic versioning. Generally, there is only one version of a given model deployed at any given point in time. Because of the size and therefore cost of building derived datasets, it is very unusual to keep multiple versions of, for example, a `dim_customers` table around to be referenced. This lack of semver prevents Team B from incorporating upstream changes on a predictable cadence and therefore leads to potentially very many breakages with every single release from Team A. IMO this is the biggest problem with the current state of the world.
Testing / test coverage. While dbt does include testing capabilities, these tests are designed to validate the underlying data. This vision of the world requires tests that validate code. Without first-class support for these type of tests and corresponding test coverage metrics, it is challenging for downstream consumers to know what is a reliable foundation to build upon.
Public / private interfaces. There is no great mechanism to expose only a subset of Team A’s functionality to Team B. While it is possible to imagine ways of hacking this, it is certainly not something that is straightforward or widely practiced today.
I believe these are solvable problems, although certainly non-trivial. And I believe that solving these problems would do more to enable distributed ownership over the collective resource of knowledge than anything else I can imagine. If everyone talking about data mesh wants to dive in and think about solutions, I think this is what we have to solve.
Imagine an organization with 100 distinct teams of 10 people each, all involved in curating knowledge in their own domain, all publishing it out to the rest of the organization using certain predictable conventions and tooling. No central authority governing the process. Highly reliable and responsive.
I think we could could actually live in this reality.
There’s a lot more to be said about this…it’s where my brain seems to be hanging out at the moment. If you have thoughts, I’d very much love to mix them into the slowly-bubbling stew. Respond to this email, yell at me on Twitter, or DM me on dbt Slack.
From the rest of the Internet…
Pedram Navid on Data Mesh. Agree with almost all of this, although not with the “how do we get there” part, which is really what got the thinking started for my rant above.
👌 The obligatory Benn Stancil on the experience of using the modern data stack (disconnected and impenetrable).
🤔 Natty takes a data-driven look at the analytics engineer.
🖥️ I wrote a quickie on why IDEs matter in the analytics engineering workflow.
🤯 Mike Weinberg, as usual, blowing minds about the future of data systems.
💬 Fantastic conversation about who is responsible for the performance of SQL that runs inside of your data warehouse. I agree with the group’s conclusions: experienced AEs must have performance optimization expertise. There is just no one better positioned to do this work.
📈 Solid Hacker News conversation on the state of BI, centered around the launch of a new narrative-style viz tool called Evidence (which I like a lot!).