Ep 55: Data Mesh Architecture at Large Enterprises (w/ Moritz Heimpel and Ben Flusberg)

Ben Flusberg (Cox Automotive) and Moritz Heimpel (Siemens) on collaborating with data at scale

Dec 08, 2023

Moritz Heimpel from Siemens and Ben Flusberg from Cox Automotive have very similar jobs. They both act as stewards of the data strategies of large, complex organizations, and in this episode of the Analytics Engineering Podcast, they dive into some of the ways in which their data needs differ from simpler organizations.

In this episode, we want to get into what it’s like to collaborate with data at scale. Ben and Mortitz share their experiences adopting a data mesh architecture and what that looks like at their organizations.

There’s also a fascinating discussion about internal charge-back models and how they impact the incentives of users of a data platform. We’d love to hear what people think about this.

Also, this episode is Julia’s last as a host of the Analytics Engineering Podcast. We’re sad she’s leaving the show, but also excited to say that she’s now deep in the weeds of building the AI startup LangChain. Julia, it’s been an incredible four seasons. Thank you so much for building this podcast together. This is also the final episode of Season 4, so we’ll take a couple of months off here and be back in your feed in early 2024.

Listen & subscribe from:

Spotify
Apple Podcasts
Google Podcasts
Stitcher
TuneIn
RSS feed

Read below the key takeaways from this conversation.

What do you think about facilitating collaboration? Moritz, I was in your office several weeks ago, and you showed a slide that had this set of data interrelatedness where all the lines connected to each other in this big spider web and it felt challenging, let's just say. How are you picking that knot?

Moritz: That was a slide from our legacy world, which made transitioning very challenging. It was a mix of a data lake and a data warehouse. There was no real, let's say formulated mesh strategy in place, but we already knew back then that the more people we get on the platform, the better for us, but without really a strategy behind it.

Over time, this ecosystem grew over 10 years to, in the end, 600 different services and project teams running on that platform. And all of those projects and services somehow exchange data between each other, but there was no catalog or no marketplace in place that structurally gave that some order.

When we decided two, three years ago to innovate our stack and get all of that stuff that is running on-premise into the cloud, one thing we looked at is the interdependencies between the different teams. We needed to figure out a certain order, within which the teams are migrating, and that needs to make sense because we needed to consider these dependencies. We had between those 500 projects, 1,200 data exchanges going on, sometimes being directional, sometimes only one direction. I had more than 6,000 objects involved.

That was the Gordian knot that I showed you that we had to somehow solve to migrate. That was also one of the leading themes that we used when we thought about how to build a new platform. How do we avoid that from happening again, because its super intransparent. It makes governance and ownership extremely difficult.

This is also how we gradually developed this understanding of this domain thinking, distinguishing the data world into different domains, and that we want to exchange data within the domain, but also across domain through data products of a clear data contract and clear terms of use and how they can be used.

I think we are still in that transition, but we have built the platform in a way that you're technically enforced to do that. You cannot share data any other way anymore, which was a long discussion at the beginning, because the team for 10 years was simply used to the fact that they can just push data back and forth between them without anyone noticing.

That was a paradigm shift that we introduced and which cost in the beginning. After one and a half years of discussing this back and forth from various different angles, this is something that is not only accepted, but promoted because everyone understands that value.

There's a structure to it that enables people to work with data in an easy way. I think we learned from mistakes in the past that led us to adopt our own version of a data mesh.

We are not doing a perfect data mesh by the books, but we have taken a lot of principles and ideas and then translated them to our needs.

We also have given a lot of thought on how to incentivize federated teams to build efficient solutions. It really turns out that, as in private life, money talks. If you have to pay for your consumption and if you have to pay for also bad implementation through consumption and credit costs, then this really creates an implicit incentive to focus on efficiency, to ask proactively other teams, whether they have similar problems or whether they can exchange.

Through that mechanism, there’s the collaboration aspect that is automatically created. That's something I really found interesting coming from a world where everything was on-premise. Everything was fixed costs, where those discussions never happened. They were just non-existent.

The only question for any data team was does my pipeline, my job, does it run through? Then I'm happy. Or does it not? Then I need to tune a little bit. Now the discussion has completely changed.

Tristan: The idea of like, how do you, in a distributed way, create the right incentives that lead to the right behavior is so fascinating.

It sounds like you've almost created pseudo-market-based mechanisms too. It's very interesting. There's another economic lesson that I'm worried about. That's less positive and that's the tragedy of the commons. I'm curious how both of you think about this.

One scenario is, you have a business unit A that produces a data product that is very relevant. That data product is used by some people in business unit B. So they take a dependency on that. For a while that works, then there's something that happens that breaks that dependency, but business unit A and business unit B have totally different KPIs and goals and financial incentives.

How do you get the people in business unit to care that they broke something? They can have all the visibility in the world, but they may just not have the incentive to fix it. How do you make sure that they fix it?

Ben: I'll share a couple thoughts. This is not easy at big, complex companies and we've been as an enterprise trying to drive much more of an enterprise mindset across all groups from a history of siloes.

Part of that is through incentives of how bonuses work, but part of it is just culture. I think it is really the bigger thing having culture from the top thinking from an enterprise mindset. I'll also add that when we talk about data mesh, having the centralized infrastructure for data.

As Moritz was saying, eliminating those spider webs, eliminating the point-to-point connections and having a standard approach where the data is all running through one platform with similar data management practices, similar accessibility, it streamlines those conversations about if something changes or breaks.

At least getting the visibility that it is broken and automatically generating those notifications. Of course, somebody upstream has to do the work to make a change and improve the data quality and fix it. Hopefully, the businesses will work together and we'll help facilitate those conversations.

How do organizations handle data security, especially when dealing with sensitive or personal data, and what mechanisms are in place for access control?

Ben: We've established 3 levels of data shareability. The first is highly restrictive where that could be sensitive personal information. It could be contractually restricted. Based on our vendors and customers. The second would be a more limited tier, which is less sensitive personal information, some proprietary information. And the third tier is not limited or restricted at all.

And so the way we've set it up is that when our internal data suppliers are providing this data to the centralized platform infrastructure, they have to answer a series of questions about what that data is and where it comes from.

And based on that, we assign the shareability level. And then when someone is then requesting access to that data, based on the shareability level, we have different policies in place for how quickly and easily they get it, the lowest tier, they just get access automatically. At the very highest tier, it actually has to have CISO and legal approval for access to that

Tristan: If we're talking about these three levels of access, where does that actually get implemented? Is that in a Snowflake permission somewhere? You both mentioned that you're using Snowflake. Where's the implementation of this policy? It can be there or like in a Calibra or some other data.

Moritz: I mentioned that we have introduced this marketplace where people can publish their data. Technically that's just sharing a view or an API through a certain mechanism that we have deployed in the background that you can also trigger through dbt, by the way. Then while publishing there is a process in place through an app where you have to maintain your data contract.

There are things like security ratings, data privacy, and other kinds of mandatory things that you need to fill out. And this metadata is on the one hand stored in Snowflake tagged to the individual objects that are shared as well as visible in our catalog in the background as well.

So no matter how you get to the data you should permanently see, what is the evaluation by the owners on how to use it.

Tristan: Got it. So you, you both have implemented solutions for how to take this policy into actual physical instantiation.

Ben: I'm just going to add one thing into what Moritz said about the challenge being execution and I totally agree.

And one thing we've done is every few months, we audit the shareability levels. And just recently, we're changing a significant percentage of them. Some of them were made more restricted. Most of them were actually made less restricted, and that's helped us learn. Maybe we need to change the kinds of questions and analysis we do on the front end so that we have less auditing to do in the future. We're always iterating and evolving.

Moritz: I think we behave very similarly. We have three levels. The challenge is not formalizing that framework, the challenge is actually executing it. It's not always clear what data is ranked in what category, especially if you combine various data sets, you aggregate, and do things like that, then also these levels might change.

I think it's a difference whether you have an HR report with all the salaries and I don't know sickness days of all your employees or just an aggregation. Employees by country, that's a completely different thing, although it's just an aggregate of the underlying data. It is conceptualizing how sharing and how security and what kind of restrictions should exist where.

It takes some thought, but it's quite easy executing it and applying it to individual data sets and data products. That's where we still have challenges and where you also see that people discuss that a lot because they are uncertain and they don't want to make mistakes. So that's at least what I'm observing on our end.

We do have a tendency. I'm not sure how that is on your side, Ben, that, at least from the past, too many people had a “my data mindset” instead of an “our data mindset”. This is something we try to crack and change. And that's this cultural thing that Ben, you mentioned earlier. But of course we have to adhere to legal and other restrictions, without any compromises.

We try to have a mindset that when in doubt, share. No one would share any kind of personal data where you have names and things like that, that allow you to conclude things that you shouldn't be allowed to conclude.

There's also different laws in different countries. We deal with, I don't know, data in China, completely different compared to data from Europe or the U.S. so I think that's a constant challenge and I'm not seeing how to permanently solve this. It's just a case-by-case decision, while adhering to a framework that is in place.

What we technically say is that the data product owners or the data asset owners, they in the end are responsible to decide what kind of security level and what kind of restrictions they would apply to their data product when sharing their data with the organization, just to bring this like way down to implementation for one second.

Looking 10 years out. What do you hope will be true for the data industry?

Ben: I'll start by saying 10 years is a long time, and just look at how much has changed in the last 12 months with large language models, so it's hard to imagine 10 years, but I'll give it a crack.

I think one is greater democratization, so it becomes easier to bring data together, make it available, to find, discover, and easier for people to use. That'll be through tools.

I think another thing I'd like to see is greater data literacy. So sure, you can make it easier to access the data, but then more people across organizations understand the importance of data, how to use it, what it means, how to interpret it for decision making. I think that's how this data will be used much more broadly.

Then I think finally, we're going to be challenged in the coming years with ethical AI. Frameworks for minimizing not just bias, but disinformation is what worries me a little more.

Moritz: I agree to all what you said. Your question reminds me of a discussion I recently had with a colleague of mine who asked me how do we actually realize when we are data driven and so what would it be like waking up in a company the next day and suddenly everything is data driven and. We had a laugh at first because no one had a really good answer.

I think that what I would hope to see is a meeting with a bunch of people about a certain topic and they're all informed. They know what they're talking about. No one is guessing. If there's something unclear, we can look it up somewhere in, I don't know, our data repository, which is on the fly, not sure whether we will get there in 10 years, this is how I kind of hope the thing develops and that we make a huge step toward a state where gut feelings or decisions by processes are no longer the predominant thing. And whenever there's a decision to take, we just look at the data. That would be something I hope to be seeing in 10 years.

The Analytics Engineering Roundup

Discussion about this post