Ep 24: The Hard Problems™️ of Data Observability w/ Kevin Hu (Metaplane)

Data observability is really a mashup of many complex problems - how much of it can truly be automated? How much is just enough?

Apr 08, 2022

As a PhD candidate at MIT, Kevin (and friends) published Sherlock, a data type detection engine (a surprisingly bedeviling problem) for data cleaning + data discovery.

Now as co-founder and CEO of Metaplane, a data observability startup, Kevin applies these same automated data discovery methods to help data teams keep their data healthy.

In this conversation with Tristan & Julia, Kevin wins the coveted award for “most crystal-clear explanations of complex technical concepts through physics analogy.”

Listen & Subscribe

Listen & subscribe from:

Show Notes

Key points from Kevin in this episode:

Not everybody goes on to get a master's with a focus in data visualization. Can you talk about that? How did that come to be?

After I transferred to MIT, the first course that I took was an experimental lab course called Junior Lab. And this was notorious for being one of the hardest like weeder classes for physics students. Everyone says that they love physics until I take this class and then they transfer out.

One of the real privileges of going to a school like MIT is being able have colleagues that are really existing on another plane, like physics Olympiad gold medalists. One of them was my lab partner at the time, and I realized in going through this course that every two weeks you replicate a Nobel prize-winning experiment, then everyone did the experiments in the same amount of time. But where things got difficult was analyzing the data and presenting it. The people who didn't know how to program in this case, Python or MATLAB, and then present it in Latex had a really, really difficult time.

In parallel to that, my sister was finishing up her Ph.D. in biology. She's a neuroscientist right now, and she was collecting data on fish behavior for five years. And so towards the end, she was like, Kevin, can you help me analyze this? And I thought that was the most ridiculous thing, right? That you could spend five years collecting this data. You're a high-powered scientist and you're bottlenecked by R.

Is data observability just another play on data quality, or does it have a distinct new meaning in your mind?

For me, data quality is a problem that people in the real world face, and data observability is a technology. As far as I know, people who work with data don't wake up in the morning saying, "Oh no, my biggest problem is data observability" — if you exist out there, please send me an email, I'd love to chat. But it's a technology that can be used to address data quality issues as well as other issues, but it's not necessary to solve that problem. And for me, data observability is a concept that really describes how much visibility you have into your data systems.

One useful proxy to that visibility is to what degree can you reconstruct your data at any point in time? So, to give an example, if you have a table with some numeric distribution, if you have the mean and the standard deviation of that distribution, you can construct it to a reasonable degree.

If you have a column with categorical values, you can also construct that with the number of nulls, but it contains PII — but never fully. The goal is just to be able to know at any point in time that you can solve whatever problems you have at hand. And this is where it goes into the software observability world where metrics, traces and logs tell you about the state of your software infrastructure at various points in time.

Help us understand why metadata, metrics, lineage and logs make the system complete and how they all fit together and actually solve the user's problems of what actually happened.

Part of the motivation to try and derive a new set of pillars is while there's a lot of ink being spilled about how we in the data world can learn best practices from the software world, there are important differences as well, right? You can't copy them one-to-one without at least thinking about it a little bit critically. And for us, we tried to go back to like first principles and say, if we were just trying to describe any system, how do we start?

And now going into another physics analogy, we said, "Okay, if you're trying to describe this glass of water next to me, how do we do that?". Well, one, there are intrinsic properties to the water such as the temperature, and the entropy of the water, and there are also external properties. Like how much water is in here? I can double the amount of water, but the temperature can stay the same. I can increase the temperature, but the volume is the same. And I think the analogy in data is what are the internal characteristics of the data, like the distribution of a column, whether or not it contains PII? And what are the external characteristics? Like, what is the schema? How fresh is the data? How many rows are in the data? And you can change the distribution and you can change these external characteristics a little bit independently. Of course, they are related, but they can change without affecting the other. So that's metrics and metadata.

Another angle is to think about a system in terms of its interactions, right? Internal interactions and external interactions, where within a data warehouse, for example, you have the relationship between datasets, which we like to call a lineage, right? Where one data or a set of data is produced by applying the operation to other pieces of data upstream. So that's kind of a relationship between pieces of data.

And you also have external relationships. What machines produce this data? Who is interacting with it either as a data producer or the data consumer?

Is there something fundamentally different about the way people think about data observability vs. software observability?

A few things come to mind. One is that the tools that we have available to us have made it a little bit difficult to get that kind of information, which is why I love the dbt Cloud and metadata API.

Because finally, I have a tool that makes the lineage easy for me to parse. I just grab the JSON and all the metadata and logs. It's very easy for me to consume. I think if it's not easy, there are so many other things for data practitioners to do that it's kind of not a very high priority.

I would say the biggest difference, and why data observability is a problem now is because the stakes are higher, right? For the past few decades the main vein of data throughout organizations was decision support. This was a big use case that Edgar Cod talked about in his seminal papers was using data to help support decisions.

Today we refer to that as like business intelligence and dashboarding. But we also see data going to a whole bunch of different use cases nowadays. Right? Whether it's being Reverse ETL back into a go-to-market tool like Salesforce. So you can start targeting and leads or train machine learning models for analyzing allocating ad spend.

Data work happens to be really mechanical, with a lot of practice. How do we get more of the qualitative things like the business context to determine that a number doesn't feel quite right in your observability system?

That is such a good question. And I just want to plus one you for a second there by saying that data begins and ends outside of the analytics team, right? As much as we talk about warehouses becoming the source of truth, it's not the source of any truth, right? You're a Snowflake does not actually create data so that I totally agree that the main expertise lies in the people who are producing and the people who are consuming the data and that there is definitely a limit to how well a system like ours or any external system can really understand the semantics for lack of a better word.

And for observability tools, I think one of the hard challenges is being able to use human time and attention effectively, right? I mean, those are the most important, like the most valuable resources in the world. And we will need some annotations from real humans that understand what the heck is going on, but we don't want them to constantly be looking at a tool and all the tables. For us, we try and automate as much as possible and only get human feedback on the most important or uncertain anomalies.

What's the viability of having a lot of integration work to have a single picture of metadata?

I think the viability of doing that depends on how the utility of the tool scales with the amount of integrations you have for a given customer — that seems a little bit abstract. But, in a data analysis tool, I would say if you're missing a key integration, the rest is much less valuable, so that is important to cover all of the bases.

Whereas at least what we've seen is if you hit the big bases, like integrating with Snowflake, integrating with dbt and integrating with Looker, which is by far the median data stack for our customers, then you've already covered a lot of the bases in terms of the data, and the rest are like optional.

The audience is different, the types of integrations they would have to build are different. And what I'm willing to bet on is almost a case of like analogous evolution where like right now we're building a Clickhouse integration. And the people who are asking us for Clickhouse integration are mostly software engineers who are doing data work. They wouldn't call themselves data people or purple people, but they're essentially doing data work. And I was looking into Clickhouse and it's like, wow, this is so similar to the OLAP databases in our world: Snowflake, Redshift and Big Query.

But it almost emerged in a whole different ecosystem. It's almost like porcupines and cacti, both evolves signs, right? But they're not related to each other. It just turned out that having big pokey things stops you from getting eaten a lot of the time. But people wouldn't generally put Clickhouse and Snowflake into the same bucket in terms of technologies to evaluate.

And similarly, I would imagine a Datadog building something similar. They might not call it data observability and it might not be marketed or sold to data people, but that it's like functionally very similar.

Looking 10 years out, what do you hope will be true of the data industry?

No matter what vendors like me will say, I think the number one problem facing most data teams today is hiring. Like trying to bring on the right people for the job and I've helped grow the data team. And I think things are trending in the right direction. People talk about software being this enormous industry and data will become an equal, enormous industry, but to boil that down, it comes down to is like one single person making the decision to take their career into the data world versus some other industry.

And we know, if you're on the opposite side, that working in data is very, it can be very engaging and very fun and you get to learn a whole bunch of different things and talk to interesting people. But my hope is that we can make it as easy as possible for that person making the decision to say, "Okay, maybe I'll take my career in the direction of data".

And there's so much that we can do to do that, right? Like for a society to invest in its citizens, for companies to invest in its more junior employees for people to invest in themselves, making that super easy. My hope is that in 10 years, there's almost no barrier to work in data if you want to.

Links mentioned in the episode:

Data observability and why it matters

More from Kevin:

Find Kevin on LinkedIn or Twitter @kevinzenghu.

The Analytics Engineering Roundup

Discussion about this post