Making numbers match
The life changing magic of tying out numbers, or how to get to trustworthy data from first principles.
One of the best data humans I’ve worked with started her data career as an accountant. I’ve learned a lot from watching her turn over proverbial data rocks on her quest to “make the numbers match”, something she said to me often with a kind of sparkle in her eye. In the process, she would help optimize millions of dollars in revenue, identify all kinds of billing failures and errors, and make a significant impact on the operations of the business. (One day, perhaps we can collectively convince her to guest author a post in this newsletter, but in the meantime I hope she forgives me for putting her on the spot!).
This week, Paige B. and I were comparing notes on how we get to ‘trustworthy’ data in situations when ‘good data sources’ aren’t (yet) well known or documented. This is a pretty important skill for a data professional — it’s our job to know the right way to get to the right answer, and then make it easy for everyone to get there. But what do you do when nobody knows the right answer (yet)? You… figure it out.
The conversation made me think about my former colleague, and what I’ve learned from her and other humans I’ve worked with over the years about getting to ground truth.
Ground truth isn’t a single place
It can be discouraging to look across your data model and find several datasets that should match up, but don’t. As you investigate, you realize they all make different assumptions, they were all made at different times, and they all leave you wondering — which is the right number? The answer can very well be — none of the above.
That sounds scary, but it’s actually a great place to start. Here’s what I do when that happens:
I start by joining on common unique keys and counting things (unique ID counts, simple sums of numeric values associated with those IDs). This tells me which dataset is more complete — has the most stuff in it — and I use that as the dataset I join everything else to.
Then I try to figure out what’s different and why. Which IDs exist in one dataset but not the others? Why are they missing? What assumptions are being made, and are they valid for the question I’m trying to answer?
The ‘why are they missing’ can be a deep rabbit hole! My first assumption is usually different filtering, mistakes in joins or elsewhere in the logic, but sometimes everything looks suspiciously fine. In those situations, I find it helpful to go straight to the source.
Get as far upstream as possible
Is the column or value that is different across my data sources user facing in some way? In other words, can I see what the customer is seeing? Admin mode on a customer account is of course the ultimate “upstream”, but isn’t always an option depending on your organization’s security policies and location your business operates in.
That said, if your business uses your own software/services internally, there’s one customer you should have access to — yourself! What does the data recorded like for you, as a user, when you perform basic actions? Does it match the activity you expect? Why/why not?
Other forms of ‘upstream’ that may be useful:
looking at the code that generates the event/action/value you are expecting. Is there documentation in the code base? What is the git history of that file? When was this bit of code last touched and in what context?
the source system generating your data (e.g. Salesforce for sales bookings, Stripe or your billing platform of choice, etc.). Grab a handful of those unique IDs that don’t seem to consistently show up in your data, and take a look at the last few months of records for the relevant customer and see if anything jumps out at you.
Timing is everything
And no, I’m not talking about time zones (although you should most certainly definitely absolutely rule out time zones every time).
I’m talking about figuring out when records started to diverge — can you pinpoint a specific moment? Is it only the most recent records? Or a specific period?
This can help you narrow down what to look at when you look further upstream. And it will help you when you finally, inevitably
Talk to other humans
Specifically, other humans outside of your immediate team who are closer to the source data. What do they know about the records that don’t match up? What do they know about the time period when the divergence occurred?
Going through all of these steps usually yields a lot of contextual information and helps me to not only feel confident in what the ‘right’ number is, but also to understand the business better. Going through this exercise and digging deep is how you’ll start to be able to quote numbers off the top of your head — “oh yeah we had X customers in Y period doing Z thing because… “
Getting to ground truth quickly
It’s easy to spend a long time digging through source systems and production code. Ideally, you’re not doing this the day before business close. I find it helpful to go through this exercise during periods of down time when I have the time to follow up on a thread. That way, when the business needs that quick answer — it’s already at my fingertips.
Other things worth reading:
Learning: The Meta-Skill for Accelerating Impact, by Caitlin Moorman, is a masterclass on self-awareness in the workplace and how to take control of your personal growth. An extra 💜 for writing out what “good” looks like for each of the core learning skills.
Katie Bauer’s recent Twitter thread on speaking data plainly, if you still partake in the blue bird site. It would seem Twitter embeds are broken this week so here’s a screenshot (and I guess see you on Bluesky maybe?)
Bees and Pesticides: What you should know, by Jessica Johnson, is a candid look at pesticide production and why moving to organic is hard. It also comes with an inforgraphic — love!
The ultimate guide to churn reporting, by Olga Berezovsky, is your one stop shop for churn measurement and reporting in these recession times.
Git 101 by Tae’lur Alexis: How To Create a Branch From Master and Make Your First Commit!
Data + Poetry = art. What happens when you ask Chat GPT to make some data visuals and write some poems.
I would say that truth is really just a form of consensus - "you agree and I agree and she agrees so we're all agreed". If two metrics fail to reconcile, it's the same as a lack of consensus. And often we have to achieve consensus across many things (this system, that system, their intuition, etc.) in order to say something is "true". In many ways truth is often quite political.