Interfaces and Breaking Stuff
The MDS is too tightly-coupled; we need cleaner interfaces and expectations between subsystems.
Barr Moses and Andrew Jones talk about GoCardless’ approach to implementing data contracts. I don’t know that I’m ready to take a strong stance on exactly the approach that the post outlines, but I 100% agree with the problem statement and am exceedingly happy for more attention on the overall topic of contracts within the modern data ecosystem.
My primary MO as a data product thinker is stealing best practices from the world of software engineering and porting them to the world of data. And one of the things that I think the data world is tremendously suffering from today is a lack of contracts—or, as I’d prefer to call them, interfaces.
In computing, an interface is a shared boundary across which two or more separate components of a computer system exchange information.
In a world without clearly-defined interfaces, your systems are tightly coupled—everything directly calls everything else, there are no opportunities for rules and helper functionality to be inserted in between subsystems. This is how our world works today in the MDS: for example, Fivetran loads data into a table, dbt reads data from a table. If Fivetran changes the schema of that table, it can easily break the dbt code reading from that table. No contracts, no interfaces, no guarantees. Tight coupling.
This works very differently in a well-architected software system. The interfaces between systems are well-defined, and when you change something that breaks something else you can actually see that error at compile-time rather than at runtime. This is because the system itself knows what the expectations should be from different components, and the minute that your code fails to live up to those expectations the compiler lets you know it.
The importance of catching errors at compile-time cannot be understated—knowing that something is broken when you are writing the code is exactly when you need that information! Finding out that something is broken once it’s been pushed to production causes havoc.
Today, we’re catching far too many error states in production. This is why there is so much focus on “observability” today—error states show up in production and we need tooling to understand/diagnose/fix them. But instead of investing in ever-more-robust monitoring of our tightly-coupled and often-fragile production pipelines, what if we made our pipelines more robust?!
Breakages almost-always happen at the edges between systems—between ingestion and transformation, between transformation and BI/analytics, etc. But they can also happen on the boundaries of dbt projects as well when an organization structures their code using multiple projects. One project imports another and, without clearly defined interfaces, the two projects become tightly coupled. This creates all of the same problems as a mono-project, but all the sudden the authors of the individual projects get less visibility into cross-project breakages introduced by their changes.
This is what prevents most users from adopting a “microservice architecture” (or data mesh?) using a series of independent, scope-constrained dbt projects today. This situation is a problem and is an area of product research for us that I’m very interested in. A few of us have been talking about this in a Github issue.
IMO what is required in order to create a world of greater modularity is twofold:
Private vs. public methods. Allow subsystems to expose only certain pieces of their functionality to other parts of the DAG. Allow a given dbt project to, for example, expose five models but keep the 80 upstream models private. This allows these 80 upstream models to change without breaking downstream code maintained by others; as long as the 5 public models remain unchanged then the project maintainers can change the others as required. Importantly, this needs to be governed inside of dbt (using the
ref
statement) and not in database permissions!Allow multiple versions of a model to be consumed at the same time so that downstream models can implement an upgrade path. Publishers of an API give consumers of that API a fairly long period for them to upgrade to the next version—sometimes years! There needs to be similar functionality for creators of dbt models. You can’t just change something and then expect all downstream consumers of your work to update theirs within 24 hours when they get breakage notifications—this is just not a feasible way to build a data practice. The way we’re working here is almost designed to ship buggy code.
These are just absolutely-required features of software languages / projects, and in order to increase the complexity of our own work we’re going to need them too.
I’m specifically thinking about these two areas because they seem like functionality that the language/framework should provide—they feel to me like something we need to be baking into dbt itself on some time horizon.
But I think there are other ways to think about addressing the problem of upstream (ingestion) and downstream (BI & analytics) interfaces. Let’s return for a second to the “Fivetran updates the schema of a dbt source which breaks downstream code” scenario.
What if there were an automated process that constantly watched for schema changes in tables registered as dbt sources? What if that process, every time it detected a schema change in one such table, automatically ran every downstream model and test and exposure in order to check for errors? What if that process, upon finding errors, could intelligently update the staging model built on top of that source for the majority of schema changes and queue this up as a PR?
This feels ugly still... What if we went a step further—what if the ingestion tool, before modifying a schema with a source attached to it, was forced to kick off that same CI process and validate the safety of this schema change prior to making it? If the change was unsafe it would be blocked, similarly to how a PR with failing CI checks can’t get pushed to prod.
This feels much better—the ingestion layer probably should understand and respect that it is a part of a much larger ecosystem rather than simply making unilateral changes to schemas and expecting downstream tooling to clean up the mess. To quote from the original article:
data needed to become a first class citizen during upstream service updates, including proactive downstream communication
Emphasis mine…but I think we’re saying the same thing here!
The point of this stream of consciousness is not really to zoom in on this particular feature (you could do a very similar exercise for the BI layer!), but rather to suggest that, as our data systems get more complex, we need to quickly graduate through the levels of maturity that software engineers have graduated through. Data practitioners now use code to express our logic, we write automated tests, we use CI/CD and source control. We now need clearly-defined subsystems with clean interfaces! And each subsystem needs to understand and adhere to the set of expectations that the larger system has of it.
Probably, one of those expectations should be: don’t make changes that break downstream stuff.
Does that feel as painfully obvious to you as it does to me? Rather than building systems that detect and alert on breakages, build systems that don’t break.
One of the reasons that I don’t think this is the direction that the ecosystem has been pushing in of late is that it’s not something that can be solved by a single tool, a single team. The MDS ecosystem is made of lots and lots of tools made by lots and lots of teams, each of them a small component of the overall data processing system. The solution to this problem needs to span that entire socio-technical ecosystem.
Elsewhere on the internet…
HAH. Ethan Rosenthal writes a post that suggests that ML Ops should be “bundled into the database”:
I can confidently say that there’s no Modern ML Stack. This field is a clusterfuck that’s surely in need of some bundling, so I propose that we ride the
dbt
train and consolidate into the database.
I’ve talked in many different places about how the world of analytics and the world of ML are too far apart today, and that I think it is inevitable that they come closer together over the coming five years. The thing that this post points out is that there are many use cases for ML that are of sufficient business-criticality that their recommendations must be monitored in real-time. Ethan’s thesis is essentially that, if the modern data stack is able to push into more real-time territory, ML Ops platforms may not be as necessary.
I don’t have a strong opinion on exactly this, but I do 100% agree that real-time (and especially streaming SQL computations like Materialize) will go a long way towards bringing the currently batch-based world of the MDS closer towards the often-streaming world of ML.
—
Ok this is a topic that you probably wouldn’t expect me to care about, but I found this Dan Cahana piece on Cloudflare and edge infrastructure fascinating. Here’s a rather large chunk out of the middle that I think is the most important part of the piece:
And while “edge” has historically been associated with industries like IoT, VR, and AVs, I believe we’re at the beginning of a wholesale shift of internet infrastructure from centralized clouds to edge networks. Step One was widespread adoption of CDNs, which are a default requirement for new websites at this point. Step Two is the move of additional compute, and ultimately data, to the edge.
There are now compelling forces driving that transition. Speed differences were initially dismissed as not significant enough to merit moving workloads closer to users, even by Cloudflare’s own CEO. But it’s becoming clear that even small incremental slowdowns can lead to user drop-offs and lower Lighthouse scores, affecting SEO. Other factors come into play, including:
New data privacy and sovereignty laws, which are forcing developers to keep user data local—something that’s impossible in a single-region deployment.
Widespread adoption of WebAssembly (WASM), which is creating greater portability between environments at native speed and making it easier to spread the same application across centralized cloud and edge.
As Vercel* CEO Guillermo Rauch pointed out, for most developers today “cloud” means AWS us-east. The foundations Cloudflare has laid over the past 10 years are beginning to change that.
As usual, since data systems are software systems, data practitioners have a lot to learn from the larger trends in software development. And edge is fascinating. I think it is extremely likely that data privacy & sovereignty laws will, just as Dan points out, force the fundamental rearchitecture of the data lakehouse. Rather than a bunch of data dumped into an S3 bucket hosted in US-East, you’ll have a bunch of country-level datastores that know how to cooperate together to respond to queries. Each of them will perform their own computation, but only emit anonymized statistics and never the raw data, unless that raw data is staying in-geo.
This is a massive change in architecture, but it is conceivable that it could be largely handled at the infra layer and be somewhat abstracted away from most of us. My guess is that this takes a decade to come to your day-to-day, but maybe that’s conservative. Regardless, edge is something that you should at least be aware of.
—
Long-time data engineer (and recent dbt Labs team member!) Neelesh Salian started a newsletter, and his inaugural post is a fantastic writeup of Data & AI Summit 2022.
I like that DataAISummit is still fewer suits and more practitioners (going by the attendees). If the trend reverses, you know the conference is headed for its demise. I hope this conference survives in the long run.
100% agree, and I know how much Ali Ghodsi cares about exactly this dynamic. The Databricks founding team has kept, and will hopefully continue to keep, the practitioner-led community front and center.
—
There’s been a lot of conversation about what makes analytics engineering a rewarding role. This short Linkedin post by Madison Schott nails it. 💜