Have you seen Randy Pitcher’s hilarious and amazing video series inviting you to register for Coalesce? You could spend $10m at the best ad agency in the world and not come away with anything even close. So many other folks have joined in too, although I’ll never be able to link to them all. Thanks to everyone who has joined the fun :D
(PS: you have registered…right?)
Stephen Bailey writes one of his best ever posts on the topic of data contracts. The thing I want to zoom in on is that “contract” has two distinct contextual meanings:
In a software engineering context a contract is a technical artifact that prevents one from breaking certain rules—you technically cannot proceed if the contract and its underlying rules are violated. Typically the enforcement mechanisms are either a) your compiler or b) your CI.
In a legal context a contract acts as a disincentive to break the rules. You can still break them, but you also have an expectation of consequences. Lawyers/judges/etc are required to enforce those consequences.
The essence of Stephen’s piece is analogizing between these contexts, but it is not clear to me that lawyers and judges are needed to navigate technical contracts in the same way that they are for legal contracts. In the appropriate context, the software contracts enforce themselves.
The magic words there are, of course, “in the appropriate context.” What exactly does that mean? This is where Stephen’s article really shines.
This is the context a data lawyer wants to work in: “If you break backwards compatibility, you do not get to cut that release.” Or, “If you add a new table to the warehouse, you must support it forever.” Or, “If you copy and paste data into a Google Sheet, you’re fired.”
The organization must actually be committed to consequences in order for contracts to be useful. If you have a test coverage requirement that doesn’t break your CI, you don’t have a contract, you have a suggestion. If your upstream schema changes and the deployment goes through anyway…well, you get it. In this version of the world, there are no de facto contracts, only contracts implemented in code.1
The beautiful thing about contracts in software engineering is that they can in fact enforce themselves if you architect them that way. You just have to be willing to deal with the consequences.
Benn dove into the conversation about contracts on Friday with a detailed take on how data practitioners could implement contracts without needing to leave our current tooling. This is a good read, although I felt like it maybe got a touch more tactical than where I’m at. I still don’t think we know what our goals are yet, so I’m not ready to talk about implementation.
I had a very brief email exchange with George Fraser on the topic and it was very educational for me. George’s main point was that there are three primary types of data sources, and the schema of each of these types is maintained by a different persona:
Custom database, schema controlled by product engineers.
Schema-flexible SaaS products, schema typically controlled by line-of-business users.
Schema-fixed SaaS products, schema controlled by SaaS product vendor.
I think this really gets to the critical questions that I have. Who is the right person to write a contract? Who should suffer the consequences of breakage? Essentially all decisions about how contracts get implemented in practice flow from who is responsible:
Is it your product engineering team? This seems somewhat unlikely, to me, as organizational challenges will make it very hard for them to opt in to this responsibility.
Is it the owners of Salesforce and all of your other line-of-business applications? This seems quite unlikely, as these tools intentionally abstract away the underlying technical details.
Is it the SaaS providers? This seems unlikely given that it’s hard to imagine what business interest would be served here. Data teams are not the primary customer of the Stripe API.
I’m not suggesting that I wouldn’t be supportive of any of the above folks signing up. Some engineering orgs (often at data-driven digital native companies!) may well sign up for this responsibility, and that’s great! And maybe one day Salesforce will design around how schema changes impact downstream data pipelines.
But I think the two parties most likely to accept responsibility today are: a) data/analytics engineers, b) data pipeline vendors. And once you reframe the question as either how would analytics engineers create and enforce data contracts? or how would pipeline vendors create and enforce data contracts? you can start to picture what a solution might look like.
Most approaches I’ve seen so far diverge on exactly this question. I think it’s likely that we’ll know the answer over the coming 12-24 months based on who raises their hand to say “I’ll take responsibility! I’ll accept the consequences!”
Responsibility can suck, but sometimes it is the path to great power.
From elsewhere on the internet…
It’s time to learn about a new open source project! Substrait is “rethinking DMBS composability” and promises to usher us into a brand new world that is…wait for it…even more decomposed than the one we’re in today! Hah. There’s a lot in these recent slides from a talk that feels exciting, but it’s also bleeding-edge and outside my area of expertise. I add the links here for your consideration. If anyone has spent any time with these ideas or with the tech itself, please let me know. I’d love to learn from you.
--
I just read a very cool overview of SQLite, which if you didn’t know:
SQLite is the most widely deployed database engine (or likely even software of any type) in existence. It is found in nearly every smartphone (iOS and Android), computer, web browser, television, and automobile. There are likely over one trillion SQLite databases in active use.
For those of us who spend most of our time in cloud-based, scale-out data warehouses, the entire concept of an in-process, local database is a bit of a conceptual challenge. What does one use such a thing for? Why is it needed?
While most SaaS applications are backed by cloud-based databases (i.e. Aurora), which seems natural to us, there are tons of applications who want to store their data on-device and return it with zero network latency. Think mobile, think IoT. Energy-conserving, network-not-always-available, latency-sensitive. Small, local, instant data storage and processing. Cool!
The reason I link to this article is that this database profile is also coming to the world of analytics via DuckDB. My brain is much too small to really understand what our world looks like with dramatically more DuckDB adoption, but I know that it’s one of the things that many people whom I respect are the most excited about right now. Here’s a really useful use/do-not-use from the product creators:
The big question: what will we do with it?! Will analytical applications cache more data and do more processing locally? Will it power demo environments? I honestly do not know, but I know it’s worth paying attention to.
--
Chad Sanderson wrote about prototyping vs. production pipelines. The concept of “pipeline maturity” is timely, as ever-more analytics engineers are building pipelines and are in the process of learning how to walk the maturity curve. The article is worth reading for this concept alone.
I think Chad is going to be on an upcoming episode of the Analytics Engineering Podcast, so I’m excited to have the opportunity to go a lot deeper on many of these topics. For now, I’ll just inject two thoughts:
Prototyping vs. Production is not a binary, it’s a continuum. As a pipeline becomes more mission-critical to an org, it makes sense to invest in adding ever-more maturity to it. This investment can happen incrementally as business-criticality grows. I could even imagine a quantifiable metric (similar to test coverage?) that grades the maturity of a given pipeline.
dbt is used to build pipelines of all levels of maturity. Supporting this iterative maturation process has been a core design principle from day one. It’s critical not to overburden users with formality when they’re exploring, but it’s also critical to allow them to add maturity in-place. I have always thought about this as the tagline from an old Othello game box: “Minutes to learn, a lifetime to master.”
I would call all of those other things “guarantees”. An availability SLA is not a contract, it’s a guarantee. It requires constant human vigilance to maintain, including responding to pages in the middle of the night. A technical contract is enforced by the system it participates in.