Data modeling for collaboration
Do any of us feel like we're doing this well? Also this week: making data actionable, the coordination challenge of large engineering teams and what a warehouse is not.
In this issue, I’ll recap an interesting question from a recent #analytics-craft meetup, make the case for data modeling as a collaborative activity, and talk about what it means to me to do data modeling well.
Also in this issue:
Randy Au on Data Management is Context Management
Sarah Krasnik on Sales and Marketing and Dev Ops, Oh My!
Eric Weber on Making Data Actionable
Lorin Hochstein on Software engineering in-the-large: the coordination challenge
Naveen on Versatile Binning in MySQL
Enjoy the issue!
PS. Coalesce is back and this time we’re going global and in-person. Join us online or in person in New Orleans, London or Sydney this October. Early bird registration and CFP are now open 🎉
Data modeling for collaboration
We’ve been talking a lot recently in this newsletter about the need to hear more practitioner voices, and to hear more from those practitioners. Last week, we hosted the very first virtual meetup for folks practicing the craft of analytics (slack channel if you’re curious: #analytics-craft). It was a rich discussion, maybe even a little cathartic, and we’re definitely doing it again.
One question stood out for me among the others in this discussion:
Does anyone feel like they’re doing data modeling well?
It stood out to me both because it suggests few folks do, and because it made me think about what it means to do data well. Just like every initiative needs a good set of problem statements and success measures, until we define what doing data modeling well means, it’s hard for anyone to feel like they’re doing a good job.
We have several philosophies for how to do data modeling — we have Kimball and star schemas, we have Inmon’s integrated warehouse approach, and we have some approaches that are hybrids of the two. They pre-date most of the workflows we use in data today, but their age is irrelevant. Comparing Kimball and Inmon is kind of like comparing object oriented and functional programming — neither is necessarily more right than the other. Very often, the choice between the two depends on your objective. Consequently, both are great methodologies to build our craft on, but neither are good measures of success.
To figure out the right measures of success for our collective practice, we need to focus on the objectives of our data modeling activity.
Are we trying to make querying faster? Not anymore.
I’ve personally spent a lot of my time here before cloud warehousing solutions became more widespread. I’m sure you have too. Munging big data used to be time consuming. These days, faster query time is more a factor of warehousing cost than human hours spent — and we kind of like it that way. Instead, we are limited by how fast our EL layers bring in data, by the structure of our data (how often is it produced and how often does it make sense to update it?), and by how much documentation we can write to help folks find what to look for.
Are we trying to make it easier for people outside the data team to use data? Yes, but how well we do data modeling is somewhat orthogonal to this.
We call this self-service analytics, and we frequently measure our success by the success of our customers (e.g. how many humans use data outside of the data team). Having customer oriented measures of success is a really good thing but this is only part of the equation. Poor software engineering practices can be very detrimental to the customer experience. But great software engineering alone isn’t going to make a product successful. It needs good product management, marketing, documentation etc. If we are measuring engineering team performance by the success of their feature, we’re measuring the wrong thing. Not because it isn’t important to measure customer success with our products (data or otherwise), but because we are measuring something the team doesn’t have full influence over. Poor data modeling will similarly lead to customer pain, but perfecting your modeling craft isn’t automatically going to make customers use your data more.
The reason we do data modeling today is to create shared organizational meaning through data. In other words, we model data to represent important business constructs and we are expressing this organizational knowledge as analytics code. Important business constructs are shared across the business, and therefore organizational knowledge represented as analytics code must also be shared across the business.
I’m not suggesting everyone in the company needs to learn data modeling. What I am suggesting is that there are likely multiple business units working with data (various types of analysts, data scientists, operations and finance humans) who benefit from shared representations of business constructs being in code. Sharing a code base goes beyond sharing a directory structure and continuing to work in separate data marts. Sharing a code base means working together on code representing the same objects, on the same lineage, and on the same metrics to help avoid duplication, and encourage standardization.
Data modeling in its ideal form is a deeply collaborative activity, just like software engineering. This is why we need version control for our analytics code, why it’s important to have good branch, commit and syntax hygiene, and why all data teams should have a CI/CD workflow when deploying their analytics code.
And if you buy that collaboration on analytics code is our objective, there is one clear measure of success:
Can other people take your code and build on it?
Is it easy for someone to navigate your data model? How easy is it to refactor your analytics code? Can someone from a different function easily pick up and add to code you’ve written, or is it easier for them to fork and work alone?
“I don’t have these problems”, you might say. “I have a small team”. That may be true, but your business is growing. And pretty soon there will be lots of other functions building on your code base with you. Are you ready for them? ;)
Luckily for us data folk, building a maintainable codebase that is easy to collaborate on is a well worn path for our software engineering counterparts. Here are just a few articles that sparked ideas for me:
How to design your code base for a growing team. This article talks about how to organize your code and employ naming and documentation conventions to make it easier for others to build on your work.
How to write testable code and why it matters. In this article, the author goes through some very basic examples in C# to demonstrate what testable code looks like. If you’ve been struggling to think about how to write tests for your pipelines, you might find some inspiration here.
Low coupling, High cohesion. A deep dive on code cohesion and coupling (i.e. like things are grouped together, modular, and can be easily decoupled for refactoring).
I’m interested to hear from folks reading this — how do you know you’re doing data modeling well? What measures of success do you look at? What do you do to make it easier for others to build on and extend your analytics code?
Elsewhere on the internet…
I enjoyed the perspective of Randy Au’s article this week, Data Management is Context Management. Randy talks about how raw data comes with lots of rich context and that creating structured data (whether it is modeling data, creating visuals or something else) is necessarily destroying that rich context in very intentional ways. When you’re modeling data, you’re intentionally leaving behind just enough context to solve a variety of problems but removing a lot of what you determine to be unnecessary detail. When you’re creating a visual, you are focusing on a tiny sliver of the data you started with and telling a story with it. What I really like in Randy’s framing — the need for storytelling alongside the data we refine — is that it reminds us that data does not speak for itself. The more context we remove, the more we need to tell the story of the data that is left.
Did you know Bill Inmon has a blog? In a recent piece, What a Data Warehouse is NOT, he shares some snarky takes on data lakes and ELT and why they alone don’t make a warehouse. There are things I agree with in here, especially the need for data that is integrated and for data science teams to participate in the exercise of modeling data. And plenty of things I don’t agree with — like users forgetting the “transform” part of ELT ;)
Sarah Krasnik breaks down different operations jobs that often have confusingly similar names in Sales and Marketing and Dev Ops, Oh My!
In Making Data Actionable, Eric Weber writes about maintaining empathy for your data customer and reminding yourself to build for them. The 10 second rule resonated for me — is the insight you developed easy for someone to take action on? Talking to ‘non-adopters’ is also a highly underrated practice — instead of focusing on what our audience lacks, and talking about building ‘data literacy’ as the barrier to adopting data in decision making, let’s talk about what roadblocks we should tear down to help our audience to get there:
What about the people who would have used it but were blocked for some reason? Were there any access issues? Were they confused at the takeaway message? Did they not know how to find what you created? If you don’t think about the potential audience and only curate for the existing audience, your product will never be as powerful as it could be.
Lorin Hochstein offers a very helpful framework for the challenges of scaling our work in Software engineering in-the-large: the coordination challenge. Though their writing is focused on engineering teams, I think the dynamic described in this piece applies to most growing organizations. (h/t Vicki Boykis for finding this one!)
A new practitioner blog in the house! 🎉 If you are working with MySQL in your practice, you will find several useful pieces here like Naveen’s latest on Versatile Binning in MySQL.
That’s it for this week! 👋