What's haunting your data warehouse?

🧟 Zombie data. 🕸️ Knowledge spiderwebs. 🧙‍♀️Data discovery witchcraft.

Oct 31, 2021

Sam Bail is all of us this Halloween weekend:

I'm at a Halloween party in Brooklyn and there's a costume contest and someone dressed in a baggy gray sweater saying he's "depression" appears to be winning the contest. ... 2021 y'all.

This year I’m going to drape a sheet around my shoulders and go as the Data Fabric. Don’t y’all invite me to your parties at once 😏

Aaaaanyway… I really enjoy when the data articles I read in a week come together to tell a bigger story. This week, a Halloween themed edition of the Roundup is all about the 🏚️ ghouls that haunt your data warehouses 🏚️ and 🦕 the monsters that live deep inside your data lakes 🦕. Featuring writing from Anu Sharma and Dr. Ernie, and a couple of good ♾️ Meta jokes for good measure… so drop everything Venkman!

-Anna

🧟 Zombie Data

The first chapter in our Data Horror Stories comes via the intrepid Dr. Ernie, and starts something like this:

Please get me the latest version of <random Excel file I have never seen before, named using idiosyncratic or ambiguous words>. Oh, and I need it tomorrow or else we won’t {make our numbers | pass our audit | satisfy the board}.

👀👀👀 😱

If this were a horror movie, this would be the part where the audience screams “don’t open that spreadsheet!” as our data protagonist inevitably descends into layers of abandoned analyses, disjointed dashboards, and mucky models to try to reproduce the file in question.

I appreciate Dr. Ernie’s real talk on the problem of infinite spreadsheets floating around in the metaverse (I had to, sorry) — because we’ve all been there at least once in our careers. A scary number of business processes are built around ephemeral apparitions like context-less spreadsheets. Dr. Ernie calls this “zombie data” because it:

Lacks any self-awareness
Doesn’t remember where it came from
Has no relationship to its current context
Infects everyone it touches with that same mindlessness. 🤭🔥

Zombie data doesn’t just look like a random Excel file. Zombie data is also: a dashboard with outdated targets, a one-off analysis in a notebook devoid of context on what decision was being made, a data model with no maintainer, a really old view someone who no longer works at the company made that happens to have a useful name like users…

Ok but why is this a problem in the first place? Just build a dashboard, make sure it has a date and author on it, set it to auto-refresh, maybe put some tests in place. Clean up stuff that hasn’t been used in a while. Done and dusted.

Not quite. To understand zombie data, we have to look at it from the point of view of the person using it. Zombie data exists because the context in which it is produced is vastly divorced from the context in which it is used. Your data team wants to work in Tableau, Tableau wants everyone to work in Tableau, but your data consumers live in their e-mail. People don’t generally want to remember to look at dashboards. People want data to meet them where they are, be that e-mail, Slack, Teams, a Business Review slide deck or some other organizational lingua franca.

I’ll offer up one more property of zombie data alongside Dr. Ernie’s other four:

Zombie data is hard to kill 🔪

As the amount of data usage in your organization grows, so will the amount of zombie data. Because it is contextless, and spreads through communication channels that are inherently neither transparent nor globally searchable, it is really hard to get rid of. You keep exorcising it, but it just keeps coming back. 🧟

🕸️ Knowledge Spiderwebs

Chapter 2 of this weekend’s data horror stories: when you accumulate enough zombie data your organization starts to get tangled up in knowledge spiderwebs.

On the surface, more data is a good thing. As Anu Sharma eloquently points out, seeing the same problem from different angles helps us better understand the problem we are solving, and make better decisions:

Our prior understanding of historical data and the mental models we use reward us with unique competitive advantages. The more we triangulate with multiple datasets, the more multi-dimensional our mental model becomes, especially when it comes to business decisions. If business is an arms race, data is ammunition that compounds. (Emphasis mine)

The problem is that the tools we have available to us today don’t do a good job transferring the context associated with the data. If humans are “Bayesian folk”, as Anu calls us, then priors are vitally important to us because:

Data tells us a different story based on what we already know.

THIS. Data is only meaningful relative to other data. A dashboard on its own will not help you make better decisions. And without additional context, choosing between two charts showing vastly different projections for the same period is downright impossible.

If your organizational is overrun with zombie data, it becomes really hard to assemble sufficient context to develop these priors, especially for folks who don’t spend all day looking at numbers. Instead, they get trapped in a spiderweb of would-be knowledge full of tenuous links, hoping that:

the folder in which the dashboard lives might give them some clues as to who made it…
the e-mail chain preceding the spreadsheet that arrived in their inbox might give them some clues about where it came from…
or maybe they’ll find a well curated data report complete with dates and authors, but it was made two years ago and all of the authors no longer work there. Whomp whomp.

🧙‍♀️Data discovery witchcraft

The third, and final chapter of our data horror stories: when organizations are overrun with zombie data, and when building data priors involves getting stuck in an intricate knowledge spiderweb… seeing someone successfully discover meaningful data feels indistinguishable from magic. Pure witchcraft.

Ever onboard someone and tell them to just “go mess around in Looker and see what’s there?” 🙋

Ever feel like the only way to find out what a field in a database means is to ask the engineering team that built the feature? 🙋

Ever find a compelling slide deck describing a thorough business overview and wish someone put in links to all the data sources? 🙋

We analytics engineers like to think of ourselves as data librarians. We enjoy dusting off the cobwebs and cleaning house. We lovingly file, organize, and classify reams of organizational knowledge and are excited to point you at the right shelf.

But what if that’s not how people want to find things anymore? What if a digital-native in the era of Google no longer finds the physical placement of things a useful or intuitive mode of organizing information?

… directory structure connotes physical placement — the idea that a file stored on a computer is located somewhere on that computer, in a specific and discrete location. That’s a concept that’s always felt obvious to Garland but seems completely alien to her students. “I tend to think an item lives in a particular folder. It lives in one place, and I have to go to that folder to find it,” Garland says. “They see it like one bucket, and everything’s in the bucket.

I was thinking about this a lot this week as Dropbox announced the intent to acquire Command E, an application that enables you to search across all of your cloud documents. This is a major coup for Dropbox, because searching all the things is the primary interface of the (not so) new millennium.

What is the equivalent in the world of data tooling? Much has already been said about our fragmented data experience. I won’t rehash that here except to say that right now, our best solutions are either 1) do all your data things on one integrated platform or 2) ask a friend in another department. A Command E for data doesn’t exist (yet!).

What if the thing that we need to truly kill zombie data, to properly dust off the cobwebs from our analyses, is not better organization? What if it is helping our data customers build better priors by surfacing rich context alongside data reports, enabling seamless navigation across different places data is being used, and getting comfortable that there are many pathways and data sources that may lead to an answer ?

Truly frightening :)

Elsewhere on the Internet…

A thoughtful writeup from Bryan Offutt on four components of successfully scaling complex work to larger teams. No spooky data stories in this one, but I’m excited to come back to it in more detail soon!

I always enjoy reading about how different data teams are structured, how they organize their work, and how they solve some of the challenges we discuss in this Roundup. Earlier this month, Prukalpa shared a really detailed look into the inner-workings of Postman’s data team — a longer but worthwhile read!

And finally, as promised my favorite ♾️ meta memes: