Discover more from The Analytics Engineering Roundup
Ep 10: Why Data Lineage Matters w/ Julien Le Dem of OpenLineage
Julien has a unique history of building open frameworks that make data platforms interoperable. Why's he now turned his attention to data lineage metadata?
Julien has contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework for data lineage collection and analysis.
In this episode, Tristan & Julia dive into how open source projects grow to become standards, and why data lineage in particular is in need of an open standard.
They also cover some of the compelling use cases for this data lineage metadata, and where you might be able to deploy it in your work.
Listen & Subscribe
Listen & subscribe from:
Links Mentioned in the Episode:
Key points from Julien in this episode:
Why do you think open source is the right way to get standards to market? What are some of the challenges in the open source route?
So I think those types of projects really benefit from being open source, right? So, they kind of arise from a common need in the data community for shared understanding. People use different tools and they need to be able to be interoperable, and that creates a kind of a need for that mix. It needs to be open source right, it needs to be shared, it needs to be something that's going to be able to integrate in the whole stack.
So the good thing about open source in that context, it creates really healthy incentives. It's like you get started, and at the beginning there's like one or two things using it and then you grow a little bit. But the more projects integrate with it or the more people use it, the more it becomes valuable for everyone. So there's kind of everything one little person joining our like integration adding to the stack, it makes it more valuable for everyone. And so there's really this kind of flywheel mechanism and like being part of the community. Then you reach critical mass and there's an inflection point and the project takes off because it makes no sense to do something else. It's kind of those mechanisms are really important.
Apache Parquet was like that. It's kind of the first of figuring that out along the way. And then Apache Arrow, really we realized in the Parquet community. Well, there's a need for a similar columnar in memorial presentations. For sharing data I trend time between query engines and processors, and it does different constraints, so it needs to be fundamentally different from parquet. And so we really leverage that existing community and like, "Okay, we need to bootstrap the same mechanisms, right? And there needs to be a lot of people involved." We start like agree on the fundamentals and then we start building more adoption around that in a creek around these original cores.
And now, nowadays, for example, Apaches are always the same, right? You can retrieve your results set from Big Query in Apache Arrow format just because it's going to go much faster if you have Python and you're reading from Big query. Then, it's going to be much faster processing in memory — you avoid all the conversion and all of that.
An opening lineage is pretty similar in that sense. So, in that sense, it's not like query processing optimization, but you have the same need for everybody to understand lineage in the same way and to be able to exchange lineage information so that data catalogs of data quality, data observability, query engines, schedulers. They can all connect together and you can build that map or how everything is connected together. So, there are really benefits from starting in the open source, starting with building consensus around the core capability in this ecosystem and reaching out to people in other open source project or in proprietary software and kind of agreeing on a base, and then building around it. And really building like various features around the core standard, but also the community and the adoption.
This need exists, right? So all we need is kind of create these focal points that are going to get people's attention and say like, "Oh yes, we need to contribute to one thing, and this is the thing. So let's make it happen, right?".
At what point do you think that this open format for metadata is useful or becomes important to companies?
In that level of complexity and typically there's more than one team involved in data, right? As long as you have one team that’s responsible end-to-end from ingestion, ETL, transformation, exposing the data kind of end-to-end, it's really clear. People can communicate, they know how everything works, they can figure out where their definition is and it's fine. When the complexity starts growing a little bit and you have more than one team doing data, you start having friction at the boundary of the team right? And that's where teams might add like slightly different practices or conventions and things are less obvious. They know the data that they're consuming, but they don't know necessarily what's upstream of it, where it's coming from. They know the data they're producing, but they're not necessarily aware of all the things that depend on the data they're producing. And they may not have signed up for supporting a specific really difficult use case, and now they're on the hook for making that data reliable for this specific use case downstream.
There's kind of a lot of that. So, typically, an easier number of teams is one context where you start treating, and data lineage this way of building the map, right? And you have an automated way of having a map of the territory and understanding what you depend on, which depends on you, all of those things are depending on together, whether there's abuse or not.
And another aspect is people using different stacks, right? So you may have analysts who use more SQL. you may have machine learning engineers who use more training models, you may have people working on the ETL stack and getting data from various locations. So, often the complexity also comes from using different stacks and because they use different toolings, analytics people know what to look at SQL to understand their dependency. On dbt, for example, it is great at showing your lineage within your project, but then they will be lost, but actually this data is also used to train some model and that goes into the product. And when that breaks, the company loses money or something.
So there's lots of levels of complexity, and sometimes it's also just not organizational but it's just over time. Over time, people have used different technologies. Maybe they started with Hadoop and they have like five map produce jobs that are still running on their Cron jobs that nobody is touching. And maybe at some point they moved to Spark, right? And then there are a bunch of Spark SQL jobs, same thing, that are running. And then they started using dbt against Snowflake, and those things accumulate, right? And they use different eras of technology and maybe it's too expensive to just rewrite everything to dbt and say, "Oh, now we have a nice environment where we understand everything". So, there's kind of different aspects like that that introduce complexity and mix make it hard to understand what's going on or make it hard to troubleshoot problems.
In fast moving environments, where a lot of things that's changing, if they're building analytics for their product, the product is changing constantly, the way metrics are collected is changing constantly, there's lots of room for breaking things without really understanding the impact. And that's where lineage is really important, either for reliability of data or ensuring compliance with regulation or ensuring that actually this metric we are making business decisions from is actually derived from the correct data source.
What tangibly can you help people solve? And how do you give them the right insight to fix the problem vs. more information overload.
I'm wearing two hats, right? At the opening hat I can enable the ecosystem, and there are a bunch of use cases we talked about: governance, compliance, operations... As my Datakin hat, I'm focusing specifically on one on one kind of data reliability, and this is one thing — also you talked about the first generation of open source — Datakin is not hosted Marquez or open lineage, right? It's its own product and open lineage is our way to make the whole industry much more transparent about what lineage is.
And then open lineage is really focusing on the fact that the best time to collect the lineage is at runtime. When things are running, that's where you can collect all the media that you want instead of going to choose a site, whatever logging, is already existing as like bits very lassie and lassie lots of information. So if you look at concretizing the problem we're solving, we're looking at making sure the data is delivered, is delivered on time and it's correct. So there's kind of this old data quality, freshness running on-time, and really being able to quickly figure out the root cause of a problem.
When you have a data quality, it always, always comes from somewhere upstream. It's kind of like very rarely is just, "Oh, your query is wrong". Well, no something else upstream has changed, and maybe it's the way we instrument it, maybe it's 3rd-party data you're receiving, or maybe someone changed the logic somewhere, some transformation. And so it's really about quickly finding the root cause of a problem. Like, today, people can take weeks; it's hours, days, weeks to figure out what the root cause of the problem is because you see what your data looks now you don't know what it looked like yesterday.
And figuring out those root causes quickly, doing impact analyses, like understanding if you're going to break something. Because often people end up not touching anything or duplicating pipelines because when they change something that breaks something. And not only they break something but they don't know that they broke something and it's going to come back to them at the end of the month, where we can't close billing anymore, things like that. So, it's really what data can focus on.
Now, the other use cases, there are a lot of people using open lineage and Marquez, for example, for privacy. I want you to know that my user private data is going where it's supposed to be going and not somewhere else, right? Like we are using user private data in the way that consent was given by the user, like to enforce GDPR CCPA. So there's a lot of activity in that space that opening enabled because it keeps track of exactly how data has been transferred.
And, of course, there's more like general usage lineage, which is more like general understanding. I just want to understand how things depend on each other, right? It's like inspecting and building a map of all the data transformation.
Explain how open lineage, Marquez, and your proprietary software, Datakin, map together.
So opening lineage focuses on the lineage collection, right? So by itself, once you get out of open lineage there's those JSON events that represent these "job has run", and it was reading from those tables, writing to those tables...here's the related metadata...that was the scheme at the time we were right from the table...this was the schema at the time we wrote to that table...so you can keep track of when you change it. So opening age is a connection collection of metadata introspecting and getting these available.
Marquez is an open-source reference implementation that consumes those events and it keeps track of everything that changes, right? So you can see how your SQL has changed over time, you can see how the schema of the table has changed over time, and now that's connected to lineage, really having lineage at the run level, right? You see how changing the schema oft hat table was connected to change the schema of this other table, for example.
And then Datakin is another consumer of open lineage. And it's a software-as-a-service, hosted in the cloud. It works a bit like a data dog. Like, you sign up for free, you get your instance and then you configure your open lineage integration, and now you have this dashboard in the cloud that draws a map of everything you have. And then you can start looking at how it changes over time, right? Like how long it takes? Is my SQL execution getting slower over time? Is a schema a dataset changing? How do I correlate failure here whereas upstream changes? Or invalid data was my upstream change?
So, you can see Marquez as a reference implementation of open lineage that lets you look at the lineage and like explore the model. And Datakin is more of implementation of troubleshooting and really focusing on the operation use-case. Is my data showing up? Is it showing up on time? Is it correct? How do I prevent breaking things? If something's broken, how do I quickly figure out the root cause of the problem?
Looking 10 years out, what do you hope to be true for the data industry?
I see a future where data is finally reliable, right? Like, data is correct by construction. And I think that's where dbt is an awesome product. Just kind of like the way you define your dbt project. It's really about building data correctly by construction. It's kind of like dependencies are explicit, you understand your data dependency, right? It's not just running jobs after job, after job. The reason you're in them one after the other is because there's a data dependency.
Today, it's not just that it's always broken if you don't know if it's correct or not. It's really hard to be sure. And whenever something's broken or something suspicious, it takes a long time to figure it out.
So, really, in the future data is reliable, right? It's correct by construction. Just the way you build things and you define your data quality, or you define your job in your dependency, it's all by construction correct. And when you want to make a change, it's like automatically and immediately evaluated, and you know whether you can apply it or not. Like "I want to change this. Oh no, it's going to impact that", right? And you can't break things wittingly or unknowingly.
And same thing dependencies and semantics are explicit, you know exactly what the root cause of a problem is. If something doesn't run, you know that data is not going to be updated. It's like really explicit, really easy to maintain, and people don't have to worry about all the plumbing anymore, right? It's just like they can focus on what's in importance. I have bad memories of being on calls for data pipelines, and it's just like I want to live in a world where you get paged once a month and that's it. It's not like everything's constantly broken.