The real liability of data at scale

... is communication. Also: designing a portfolio based on effort vs value and preparing for your most important business meeting.

May 21, 2023

👋 Before we get into our usual coverage this weekend, I want to introduce you to a few really cool humans:

Shinya, Anya, Bruno, David, Karen and Emily 💜

They are the first batch of dbt Developers from all over the planet that we are featuring in a new section of our developer portal. If you see them around the community or the internet: say hi or give them a follow if you want to look out for cool content or projects from them.

My favorite part about this group of humans: the many different backgrounds and career trajectories each individual had before they became a part of this community. A long time ago now I wrote about how building successful data teams involves bringing together a bunch of different kinds of purple people. I can’t think of a better illustration today than these folks :)

Let this be your reminder to embrace the bits of your life experience and career that make you who you are today, and if it suits you, express them in data. You never know where that will take you!

-Anna

Elsewhere on the internet…

How Scale Kills Data Teams
Chad Sanderson

Chad is back on substack and I am here. for. the 🌶️ takes.

When most software engineering teams think of scale they imagine a surprising number of API calls or millions of records written to a database in a concerning period of time. For 99% of data developers, we don’t have that problem.
Scale for Data Teams is an issue of organizational complexity.

I can’t agree more strongly here. In my experience, organizational complexity is a few related but distinct things:

the number of hops on the org chart between the folks making decisions and the folks closest to the data;
the fracturing of tacit knowledge across departments as an organization grows and business units specialize;
and the growing complexity of the business itself which results in an exponentially growing number of inputs into your data stack, many of which require bespoke testing and monitoring… and were set up long before one’s time.

Do I think that the modern data stack is (in Chad’s words) a “liability” when teams grow from small organizations to greater scale and experience these pain points? Nah, and neither does Chad:

You might expect it will take a complete overhaul of the system and a total restructuring of how we think about data to fix it, right? Maybe it means hiring for new roles like a Data Product Manager or orienting the business towards data domains. Maybe it means imposing strict limitations on data consumers and producers or having weekly Data OpEx meetings reign in usage. It may mean getting rid of all the great tools that helped us move fast for some slower, heavier systems. It may even mean abandoning democratization entirely.
I disagree with all of these approaches. Moving fast is good. Data teams must create incentives for producers to care about how data is being used. Data governance must be iterative and applied at an appropriate level when and where it’s needed. The solution is surprisingly, painfully simple. Better data communication.

I think pieces of the data stack are still growing and evolving both alongside one another and the organizations relying on them. Seattle Data Guy’s decade in data engineering is a good honest look at this evolution (so far!) and acknowledges that this is pretty much normal evolution of tech in a new area. And I also think that better data communication is, in fact, The Thing that the collection of data tools developed in the last 5 years or so are working to solve from their various vantage points.

A few things I’m excited about that are poised to make 2023 a defining year for improving data communication:

a true headless semantic layer will mean that we finally stop solving the problem of many organizational hops by re-arranging people in an org chart, and instead, focus on defining shared constructs for the business in a knowledge layer. I’ve got an entire slide deck on what I mean by this that needs to be turned into a post, so stay tuned!
I’m very bullish on model governance as a primitive that will change the way business units codify knowledge in a data layer and share it with one another, at scale.
Folks are thinking about data contracts that enable governance at the periphery of the modern data stack as well — there’s a good overview of current thinking for batch and streaming paradigms here from Ananth.

There’s lots of work left to do in each of these areas to make these experiences delightful and integrated across the stack, but then, that too, is normal evolution of tech in a new area.

Meanwhile, brb, upgrading to dbt 1.5 🚀

Effort vs Value Curves
John Cutler

I love a good framework. I especially love a good framework that emphasizes delivering value.

When we prioritize our time, our team’s time, or our organization’s time (depending on your vantage point), it’s easy to fall into the trap of thinking in terms of lists or stacks: first this, then that, then this other thing. We codify the stack into sprints or some other time units, and monotonically chip away in order of priority.

Why is it that when we invest in other areas of our lives we know to diversify, but when it comes to investing our own time, especially data team time, we continue to think very linearly?

Optimizing for maximum value is another approach, but it isn't the right prioritization mechanism for 100% of your available time either — our colleagues in engineering will tell us exactly how short term values focus leads to mountains of eventual tech debt and quickly reduces velocity.

Over time, I’ve started thinking more in terms of a portfolio of effort investments:

some % dedicated to keeping the lights on: in your personal life, this might be prioritizing a workout, while on a data team, it might be fire fighting data incidents and continually paying down tech debt.
some % dedicated to the things that generate the most value/drive the most impactful outcomes, but diversified based on rate of return (this is where John’s framework above offers solid gold): some immediate returns balanced with medium term more complex work and some proportion of extremely long term bets that carry high risk due to effort involved. In your personal life, that might be the difference between taking a new work out class to prioritize a different muscle group and getting a new degree. On a data team, this might be the difference between low effort quick turn around but high impact projects versus a massive refactor or migration.
and finally some % dedicated to relationships (catching up with friends and family in your personal life, 1:1s, meetings and also work that unlocks new and supports existing relationships)

In your personal life, allocating effort within these 3 buckets (and within the second bucket in particular) depends heavily on the stage of life you are in — e.g. the choice to get a degree is a fundamentally different one earlier in your career vs a decade or more in, and carries different levels of risk.

Similarly, on a data team, how you distribute your effort towards driving value is also a factor of your business’ life stage and to a large extent, the macro economic climate the business is living in. Can you afford to wait two years to perfectly refactor or migrate your data before you can start generating value? How critical is it to deliver business value today even if it means buying technical debt along the way? The answers to these questions aren’t at all absolute.

The most ~~stressful~~ important business meeting
Archie Wood

And to round us out, a blog post and tweet thread that resonated with the data internet this week: that dreaded regular business review meeting.

Bobby Pinero has this to say:

I had the same experience running weekly business reviews in the early Intercom days. Stressful. Far too much time preparing and collecting data. But it was the most valuable meeting we ran. This thread brought back nightmares. Take Archie's advice.

This thread brought back nightmares. Me too, Bobby. Me too!

The level of effort involved in storytelling around business performance is not trivial. Any recurring process like a business review takes time to prepare for, and if the right investment isn’t made up front into sane metric definitions and data documentation, it can be the stuff of nightmares.

We’ve iterated internally on this process a fair bit in the last couple of years. We’ve used dedicated software to manage this process, spreadsheets, and eventually, built dashboards and automation on top of well modeled data. What I’ve learned from the experience is that there is value in taking the journey — and by extension, experiencing some of that pain — especially when it encourages conversations about how to better measure what is important and align it with the right outcomes. There is really no substitute for those conversations in any tooling, because debating how something should be defined, measured or estimated leads to a deeper understanding of the business itself, which, is, as your mastercard will remind you, priceless.

All of that being said, it certainly wouldn’t hurt to have version control in the business reporting layer and better integration between your data visuals and reporting layer ;)

Thanks for reading The Analytics Engineering Roundup. This post is public so feel free to share it!

The Analytics Engineering Roundup

The real liability of data at scale

... is communication. Also: designing a portfolio based on effort vs value and preparing for your most important business meeting.

Elsewhere on the internet…

Discussion about this post