Ep 54: Data Vault and Enterprise Data Modeling (w/ Brandon Taylor and Michael Olschimke)
Data Vault Modeling and Agile Data Warehousing
In this episode of The Analytics Engineering Podcast, we talk about Data Vault. If that’s a new term for you, it’s a data modeling design pattern. We’re joined by Brandon Taylor, who is a senior data architect at Guild, and Michael Olschimke, who is the CEO of Scalefree—the consulting firm whose co-founder Dan Lindstedt is credited as the designer of the data vault architecture.
In this conversation with Tristan and Julia, Michael and Brandon explore the Data Vault approach among data warehouse design methodologies. They discuss Data Vault’s adoption in Europe, its alignment with data mesh architecture, and the ongoing debate over Data Vault vs. Kimball methods.
Listen & subscribe from:
Read below the key takeaways from this conversation.
What is the high-level architecture of a Data Vault?
Michael: First of all, Data Vault is a modeling technique, but it's also a concept for building enterprise data systems. Let's say your customer data is super confidential; it can't be loaded into the cloud. But then it streams in the cloud. You follow a multi-part, multi-cloud strategy where you don't want dependencies on one of the cloud providers. Some are in Azure Cloud, Google Cloud, or Amazon. You want to distribute the enterprise data, and those requirements, we can meet with Data Vault.
The other thing is, conceptually, a building block is key here. And that's Data Vault in a nutshell. We have an incoming data model. It's more or less the same. It's a source model. The structure changes a bit. What we do is we break the data set into fundamental components of the data.
All your data, your enterprise data, but also your personal data on your favorite social networks has three fundamental components: business keys, relationships between business keys, and descriptive data. Those are the building blocks of your data. We capture those building blocks in three base entities: hubs for business keys, links for relationships, and satellites for descriptive data.
The idea of Data Vault is to get data as quickly as possible into the database page, and it's a very physical model then. Dump the data into the database page, but organize it on a database level, so you can quickly derive any target model you want, like the red ship and so on. The problem is you have blue bricks, raw data, as we call it, and we use transformation rules in a business vault to turn the blue bricks into red bricks, apply business rules, for example, for cleansing data, for turning foreign currencies into a leading currency, and so on. That's the flexibility. We can use data and turn it into anything we want or the user wants. That's the natural answer.
What was it like implementing Data Vault at Guild? How did you go about taking the first step of getting towards a best practice here?
Brandon: We do a really interesting thing where the first two weeks of your onboarding, you go through a sort of tour of duty, talking to your different stakeholders that you're going to be working with regularly just to get a sense of what the problems are.
I heard three things over and over from everyone that I spoke with. First, we had no way to reproduce any of our historical reports. Everything was refreshed in the data warehouse and we couldn't recreate our reports.
Second, we had a microservice-based organization and we were breaking apart these different monoliths into smaller and smaller services, and at this point, we're at around 70, 80 different source systems and the data teams just couldn't keep up with that rate of change. Suddenly our engineering squads were having to do dual rights to old systems and new systems, and that was extra effort and work on their side.
Third, every business user I talked to said that if you ask two developers for a report, you were probably going to get three different answers.
So, one of the first things I did was pull up our dbt DAG, and it was the most beautiful mess of spaghetti. Just lines everywhere. And I said, “we need to bring some process to bear on this.” Originally, when I came to the Guild, I honestly expected to be able to just do a dimensional model and be good since it's a relatively light industry.
I didn't expect all of the complications that we have, but, after hearing those issues, Data Vault is really the right choice for this historical versioning. It just comes out of the box able to change source systems on the fly quickly and easily as part of the process.
And then as long as we have a common process for how we're building, I'm not as worried about developers deriving different information because we're all looking at the same data at the same models. To start the journey, it was a lot of enablement, a lot of data literacy work. I sat down with the teams for about a week and just walked through what a hub is, what a satellite is, what a link is. And in parallel, I started working with the business to better understand what we do. There's a book by Lawrence Core called “Agile Data Warehouse Design” that I love to use for gathering requirements.
I leveraged his beam process to figure out what we do, our key business objects, and our key processes. And I took that and started building a very high-level, top-down view of what I thought the data warehouse should look like just looking at hubs and links.
What are our business objects? What are the processes where those different business objects relate? I showed that to the team, and that started clicking for them and realizing, okay, well, we get some of our data from our members from the system.
And so they were able to take that top-down view that I built and based on their understanding of their bottom-up sources to target mapping, they were able to then just go and scale with building the right satellites to get the derived attributes for each of those objects and processes.
We also found that the dbt vault implementation was incredibly useful for us just for creating all of the metadata needed for the Data Vault. And from there, we started projecting that data out into facts and dimensions, and that's really where we are today, building out that single source of truth and shared understanding of our data.
Why do data warehouse practitioners often involve themselves in specifying high-level frameworks and architectures?
Brandon: The most important part of a data warehouse is that it reflects your business.
With software engineering, you have domain-driven design. And so, within your domain, you might have a very specific concept of what a customer is, or what a prospect is, but within data warehousing, once the business wants to know how many customers they have, what's our sales funnel? You have to be very specific in your build to know what a customer is.
What are the business rules to define a customer? And so in data warehousing, we need to come to that shared understanding for the business and make sure that we're all speaking the same language. We don't want marketing to see that we have 10,000 customers, but sales sees we have 9,000. Suddenly there's miscommunication and we don't all understand our numbers. Your leadership starts to see different metrics and things go off the rails really quickly.
So I think first and foremost, all of your data warehouse implementations have to match the business and you need to have that shared understanding of your business rules and what your metrics really should look across the business.
But then secondly, there's this concept of data literacy at scale, so if you want your business users to get into your BI dashboards and start playing around and self-serving with your data, you don't want them to have to cobble together a view of customers across 10 different systems. They might join data incorrectly. There might be different ways of viewing that data. But if we come to a shared understanding, like in a Kimball model, a dim customer, and we have this one source of truth for customers across the business, then any analyst, you can onboard a new analyst and say, start with a dim customer.
That's where we get our data for this particular business object, or start with a specific fact table to get data about this specific process. And that ensures that we're getting the same metrics across the organization and we don't spiral out of control. We can instead say, we're all looking at the same view of the data
I love Michael’s Lego blocks. As long as we're all assembling the same view of a yellow tower or a green car, we're all looking at the same thing. But if you ask everyone, here's your green Legos, go build a car, you are going to get different cars.
Looking 10 years out, what do you hope will be true for the data industry?
Michael: 10 years from now, I would like to have data processing, where you are confident of how you essentially apply pattern-wise processing of data processing.
Brandon: I love that you use the word hope there, because I could probably give six or seven different doomsday scenarios for what I see data looking like in the next 10 years. But what I really hope to see, I've got two and they're interrelated. I would love to see us take our data skill sets and bring them more to bear on leveraging data for social good.
We do so much today with scaling businesses and growing that bottom line. I'd love to see us move away from that and think less about sales funnels and more about how we can use data to funnel people out of poverty.
And I think we do that through my second hope, and that's more data literacy across the world. I think everyone needs to understand better how to use data, how to interpret data, without needing a statistics degree. All the way to being able to self-serve on some of the most basic open source data that's out there today so that we can understand what it means, how to interpret it, how to properly join it together, and derive the same insights together.