Ep 34: Why you’ll need data contracts (w/ Chad Sanderson + Prukalpa Sankar)
WARNING: This episode contains in-depth discussion of data contracts. Are they a solve for the collaboration challenges between producers + consumers that impact data quality?
The modern data stack introduces challenges in terms of collaboration between data producers and consumers. How might we solve them to ultimately build trust in data quality?
Chad Sanderson leads the data platform team at Convoy, a late-stage series-E freight technology startup. He manages everything from instrumentation and data ingestion to ETL, in addition to the metrics layer, experimentation software and ML.
Prukalpa Sankar is a co-founder of Atlan, where she develops products that enable improved collaboration between diverse users like businesses, analysts, and engineers, creating higher efficiency and agility in data projects.
Listen & subscribe from:
Key points from Chad and Prukalpa in this episode:
A data contract is an agreement between the data producers and data consumers. According to Chad’s colleague Adrian Kreuziger, the term data contract is just another name for an API. Would you agree with that?
Chad:
I think it is definitely a buzzword, 100%. But I think there's also an implication within the term contracts that has resonated quite a bit with people. I think you're going to lead this question, so I won't answer it fully, but there's the implication when you're talking about a contract that there has to be a conversation and there has to be an alignment and agreement between two sides.
And in many cases, when we're talking about APIs, it can be thought of almost like a one-way transaction, right? I'm a software engineer. I have an API. It is what it is. I produce that for my customers, and they can take advantage of that or they can't. But with data, it's a little bit different.
Oftentimes the software engineering team doesn't have any understanding of how their data is actually being used or what those downstream use cases are, so the importance around collaboration and the contract with an enforcement mechanism is pretty critical. But fundamentally, you are right.
It is just an API under a different name.
Why has the term data contract struck such a chord with data practitioners?
Prukalpa:
My two cents on that is just because of how emotional of a problem it is. I don't think that there is a data practitioner who has not faced the impact of not having some kind of agreement with a data producer who changes something inevitably and that leads to something that you're working on a daily basis.
Everyone's had the 3:00 AM call that's happened to them, right? I remember this one time for me, prior to Atlan, we used to be a data team ourselves and I got a call from the Prime Minister's office at eight in the morning and basically said " Look, a number on this dashboard doesn't look right".
And I opened up my laptop and there was a 2X spike that day in the data. It's clear something's wrong. And there was nothing I could do at that point. And I called my project manager, who called my analyst, who called my data engineer, and then he pulled out all the logs, but he couldn't troubleshoot it. And this was like three modern data stacks or like around very early days morning.
So this was way harder to do even and put eight people, four hours like people figuring out what went wrong. We didn't know what went wrong. And eventually, at the end of all of that, we discovered the system that we were pulling data from decided randomly to send us like they were sending us daily data, and then suddenly one day they sent us cumulative data instead, for that day.
And so suddenly there's this random 2X spike that and obviously we lost agility then, but at that moment, like two years of hard on trust with our stakeholders right there at that moment, because of not knowing how to answer that question, it wasn't even that something broke. It was not known why it broke.
Like that trust broke right then and there. And that's a story of any data teams we had like we have a customer, it's a public company, and they built out this like flow chart diagram of what happens in public reporting. And they actually quantified that. There are literally 20 days of an analyst's time that is spent on fixing numbers that break.
And they quantified this down to what happens when something breaks and someone goes into and checks it, a calculation change, and then they message on Slack and then they look at tickets, and then they send a message to somebody else. It's a lot of work. . And and I think Chad quantified this a little bit as well, right?
Like at some point your entire team is consumed with just fixing stuff.
What's the abstraction layer that you're proposing in your model of the world?
Chad:
Yeah, so I think that maybe never is a strong word. The way I see it is that there's basically a test environment.
There needs to be a test environment or prototype environment for data where you get the full copy of the production table into Snowflake. You can experiment with that. You can understand what that data means. You can try piping it through a model, try putting it into a dashboard, but the point at which that actually needs to become production we need to start surfacing this to a customer, or it fuels a machine learning model, or it goes into a dashboard that the board looks at or something. Then there's another level of quality that needs to be put into place. But to answer your question directly. Like what do I think is the right layer of abstraction from CDC?
Like Prukalpa said I do think that it is a maturity curve and I think the start of that maturity curve is basically still just like CDC right? It's people saying there's some CDC there's schema or property that's very meaningful to me and I just want to ensure that the schema doesn't break terribly. And if you are going to introduce a backward incompatible change, can you please tell me and just give me a heads up that this change is coming? I think that's where I call that a contract. And we have quite a few of those contracts at Convoy. It's like a pretty easy on-ramp for teams that are just getting started with this process. But very quickly those teams realize okay, I now have five or six or seven teams that are all using that data, and they're all asking for contracts. And sometimes those contracts are contradictory. And that's not good. And we want to essentially provide this layer of abstraction.
And what we use is stream processing. And so we essentially say you can go out and, we use KSQL, but you can use Materialize or whatever it is that you want, and you say, software engineers, you know what data you need to vend as part of the contract. You've already agreed to do it so you can decouple your production table.
And whatever contract data you're pushing to a consumer, you can write some little KSQL query that ensures that all the data is in the right format, and then that is the API that you then deliver to these consumers. And all of those folks that previously were asking for totally separate contracts can now just take one.
But I think there's one level beyond that, which is total decoupling. And total decoupling would basically be events. So these are events submitted in the production code. These semantic objects may be what I care about when a shipment was canceled and I need to go into my service.
I need to implement that. And I treat that as an API. And now that's completely separate from my production table, and I can operate my production table however I want. So I think that's the maturity curve of abstraction over time. At Convoy, we have people doing all of those things.
The most advanced teams are doing events, super high-quality events, or very, specific use cases. Some folks are doing the SQL abstraction and some folks are just saying, don't break my CDC stuff
Are data contracts needed for production use cases?
Chad:
I basically look at it as a spectrum and on one end of the spectrum, you have things that are very clearly prototypes. Like we just implemented some brand new feature. We know that feature's going to change. We need to get a really directional sense of how things are going and just understand what this data is.
But I'm not going to be showing this to my CEO anytime soon. I'm certainly not going to be pushing it to some dashboard embedded in the product or anything like that. That's definitely a prototype use case. And then on the other end, we have things that we know for sure are not prototype use cases.
We've got a financial reporting pipeline and we need to report those numbers for auditing and compliance reasons. Or we have a machine learning model that we know needs to be stable because it forms the backbone of our business. And then in between, there's a range, right?
There are some things that are definitely very important. It's critical for decision-making. Do we need 100% of it to be strictly under contract? Maybe not. Maybe there are 10 columns here. Eight of them we know for sure we need under contract and maybe the other two we're still trying to experiment with and figure out.
So I do think it's a spectrum, but the sort of reference that I've used a lot at Convoy is that if data quality contributes meaningfully to ROI, that's probably a good indicator that you need to start thinking about the contract.
Do you need strict data contracts from day 1?
Chad:
No, definitely not.
In the ideal world, right? At least in my mind the way that I think about contracts and really just this whole sort of production-grade pipeline process is like DevOps, right? Do you need really finely tuned, highly structured, high governance DevOps from day one?
No. Do you need some DevOps, unless some governance from day one? It's probably really helpful. And it also allows you to scale that process really easy once you hit the point where you need it, right? It's way harder to implement this highly governed thing. If you have a software engineering team of 300 people that's never followed a process before.
So I think it follows a similar pattern here. Where do you really need to have strict contracts from day one? No, you don't. But if you have a process of data producers and data consumers talking to each other, that allows you to achieve that so much easier when you actually get to be a company of Convoy's size.
Yeah. My recommendation is usually once the software engineering team becomes so large that it's difficult for the data engineers or the data consumers to be in every meeting and like in every meeting and understand, hey, we just launched a new feature and this is what the data looks like. That's when communication becomes really difficult and something like a contract can be helpful for those high-quality use cases.
Prukalpa:
I think the only thing I would add to that though is that the interesting thing we're starting to see is that the importance of this is not just governed by the size of the company as much as it's about the type of the use case. So one of the interesting things we're starting to see is people starting to actually drive, as we all talked about, someone using a dashboard to make a decision and that the thing about that process is still a human and a human can catch the error.
But one of the things that we're starting to see a lot of is it's a 200-person company and they're doing like a medium monitoring platform, which means that this is actually live, like every change that's happening is actually live, affecting decisions that their customers are making.
And there's no human in the middle of this process,
I'm not talking about data activation. I'm actually talking about for example a data platform built on Snowflake of a company whose entire business model is based on media monitoring that they're doing downstream. Or I can give you another example.
This is a much larger company: their modern data stack is covering a live customer experience platform. So it's triggering live emails that are going to customers. There's no human in the middle making this process, right? This is actual live business, like business product stuff that is being triggered downstream.
So when there are, in this case, hundreds and thousands of change events happening on a daily basis in their daily which triggers things like PII and triggers things like security notifications that need to start happening downstream. There's a lot of actual impacts. What we're starting to see is there are much smaller teams where the data platform is the backbone of the entire company.
And in those kinds of cases where the use case is not just operational analytics or helping the company make better decisions through data, but actually helping the product or helping the end consumers and businesses make better decisions through data, through live automated platforms, the importance of this becomes a lot more even in smaller organizations and smaller teams.