Constraints-driven data team design
Also in this issue: the definitive guide to measuring data quality; data team pay parity; evaluating the cost of complexity in data systems and visualizing complex problems.
Before we dive into this week’s issue, some quick housekeeping:
Rebecca Sanjabi pointed out that one of the articles we linked to last week from SafeGraph needs to come with a very important caveat via the EFF. Thank you Rebecca for making sure the readers of this newsletter had this context and for holding us to our own standards 🙏
This week was so inspiring for me in terms of practical advice and articles for running your data teams and data systems. A list of the great content we’re going to cover in this issue:
Org design based on constraints rather than demand by Katie Bauer
How to measure data quality by Mikkel Dengsøe
The hidden cost of complexity by Chris Walsh
Some hot takes from Erica Louie on pay parity between Analytics Engineers and Analysts
Visualizing SQL joins without venn diagrams by Andreas Martinson
Enjoy the issue!
by Katie Bauer
A couple of weeks ago I wrote about how we should define “good” analytics engineering (and specifically modelling data well) by how well your transformation layer supports collaboration from different data practitioners. Done well, your transformation layer becomes a repository of organizational knowledge. What we didn’t talk about then was how this impacts your data org design.
This past week, Katie’s ever insightful Twitter thread on org design based on constraints encouraged me to think about just that:
My takeaways from Katie’s thread:
Data teams are often designed (and funded!) based on stakeholder demand from elsewhere in the company. This usually leads to a model where one person is responsible for a specific business focus area — often a very broad one. As you hire more humans to grow your team, it shrinks the surface area that one person needs to think about, and enables them to focus more deeply on a specific business problem.
Designing teams based on stakeholder demand doesn’t automatically give us redundancy — while we understand that humans need things like vacations and sick days, they often come at the expense of progress on data needs in the areas that human is responsible for. And when your “expert” in a business area moves on to another role or company, there is significant cost to the data team of hiring and retraining a replacement.
We solve for redundancy in different ways on data teams: rotating human <> business unit assignments every X period to encourage cross training; or simply hiring additional humans to focus on a specific business area (e.g. a senior and junior data professional, or growing out an entire team).
All of these strategies are focused on redundancy as it relates to the needs of the business unit, rather than the needs of the humans on those teams.
There are also the needs of systems those humans develop over time — e.g. maintaining data quality is an active and ongoing process (more on this in Mikkel’s piece this week!); metrics sometimes do things you don’t expect them to and you have to find time to figure out what’s behind the change — is it a bug, seasonality or a fundamental change in the business?
Katie argues that just like software engineering teams have minimum sizes (6-8) to take into account inevitable constraints like on-call rotations, security issues and other urgent system needs, we need to do the same with data teams to better support the humans on those teams, and the systems they develop.
Another anti-pattern I’ve personally seen that I think is implied in Katie’s writing but not called out explicitly: the temptation to split data teams into “stakeholder focused” units and “systems focused” units. We come up with ratios of one to the other — one system focused human for every X stakeholder focused humans on our teams and rationalize that headcount spend with our organizations. The problem with this model is that both units get the worst of all worlds — the systems focused units are chronically understaffed to handle on-calls and increasingly get disconnected from what their users are actually doing; the stakeholder focused units are still operating at full capacity and little knowledge transfer actually occurs.
We also do things like try to set aside X% of someone’s time to focus on anticipating rather than reacting to business needs, but invariably if we set aside 25% of someone’s time, it usually means they’re still spending 100% of their time on reactive work, and we’re tacking on an invisible 25% overhead for them.
I argue that we’re still not solving for the humans on those teams when we do either of this.
What should this look like instead? I think we already have most of the answers but IMO we’ve been talking about them somewhat separately up until now:
Design your data team more like an engineering team that works on a product. This implies zeroing in on the key business priorities for a quarter, and focusing everyone’s time on those, then building out a holistic knowledge layer together inside your data stack of choice: rigorous data model and pipelines, a well designed visual layer that is flexible and interactive (think Customer 360 type of complex data application rather than a quick and dirty dashboard).
Think about building cross-functional teams that are able to work together to solve a variety of data problems, rather than hiring individuals specialized in solving a small number of them. A great model for this in modern engineering organizations is the EPD squad which contains some combination of engineering, product and design talent working together consistently over time. These teams own feature areas, and are staffed to balance both net new development and react to maintenance needs with their features. I call these types of teams in the data sphere purple teams — teams built with humans who bring different things to the table like experience solving different kinds of business problems, experience building different types of data systems, and experience turning that raw knowledge into well scoped products delivered to customers of the team.
To enable your purple teams to work well, aim to design your knowledge layer for collaboration from the beginning. All of your purple teams (you’ll end up with more than one once you show the success of this model in delivering value) should collaborate to build a shared knowledge layer, organized in a predictable way, with standardized documentation and coding standards. If you do this well, you’ll find that teams that start on a new problem will be able to build on and extend work done by others, rather than start from scratch. Every new project will be quicker and more resilient if it uses already pre-made components that can be reused.
This is the essence of success in modern software engineering, and today there’s no reason our data teams can’t work like this as well.
Elsewhere on the internet…
How to measure data quality
by Mikkel Dengsøe
I enjoyed this piece from Mikkel so much! There’s a full roundup worth of content to unpack here, and I encourage everyone to read the whole thing. And then read it again, print it out, stick it on your monitor and tell everyone you know.
Some highlights I enjoyed in particular:
Weekly active users of a dashboard: Data people should align themselves with the value they create. One of the best ways to do this is by keeping an eye on who uses a data product which in many cases is a dashboard. Not only does this give you visibility into if people use your work but you can also share the success with team members such as analytics engineers or product squads to show them that the upstream work they put into data is paying off.
This metric is so important to be able to safely make changes to your data model and understand the business impact of potentially breaking changes you’re making. It’s also important to be able to track this metric so you can turn dashboards off once they’re no longer serving business needs and reduce the amount of infrastructure your team has to maintain.
What’s really exciting is when you start building workflows around these metrics.
Want to improve data test coverage? Make a rule that every time someone is made aware of a data issue that was not caught by a test, they should add a new test.
YES! 🙌 🙌 🙌
Criticality: Not all data should be treated the same. An error on a data model that’s only used by you and a few close colleagues may have a very different impact than an error in your top level KPI dashboard or in a data service that powers a production level ML system. Most teams have a way of knowing about this, for example by tagging data models as “tier 1”, “critical” or “gold standard”.
So much of this is often locked up in the tacit knowledge your team shares and it can and should be made very explicit in your data catalog layer (if you have one) or as close to your knowledge layer as possible.
The hidden cost of complexity
by Chris Walsh
Chris does a great job making us think more about system complexity in data work. While I think many of us understand complexity costs at a very high level, I really appreciated the deep dive in this article into how Chris thinks about setting a “complexity limit” for a project based on the needs of the problem being solved and resources available.
If the project is going to scale, then I will be more stringent about the complexity limitations. Sacrificing reliability will be considered a very steep cost.
This is so important not just in the context of data science that Chris writes about, but in data system design in general. In the case of insights focused workflows, I think of scale as the number of humans who depend on the data. The more important an aspect of your data model is to the business, the more humans depend on a metric or table, the more important it is to make sure your code is NOT overly complex, is maintainable, modular, well documented and easy to modify/extend without breaking existing dependencies.
I think there’s an entire book hiding in Erik’s post breaking down decision intelligence into a framework. It’s a dense article, and comes up with lots of new words to describe familiar activities and I think that’s one of the best features of this piece —Erik is taking parts of the decision making process in an organization, breaks them down into logical pieces and then talks about where and how to apply knowledge to help drive those decisions. New words help us take a step back and re-think what we think we already know here.
One of my favorite points that we often spend too little time on as data professionals:
Clearly, the process of appropriately framing a decision is key to not only defining appropriate decision scope, but also to ensuring that the desired decision process has been established. The frame, therefore, should at a minimum address questions like:
· What are my desired outcomes?
· How much am I willing to invest in obtaining these outcomes?
· What are my risks? Are they tolerable?
Indeed — you should know the answer to these questions before you look at any data, because they can be invalidated by your examination of the data — even if you don’t intend it to be so!
This is especially important to establish when working with data stakeholders. Focusing on levers that are available, resource constraints and risk profile for a given decision that needs to be made, is a form of social contract — it helps everyone get on the same page about whether to even do the work to begin with. This answers the question “If we get you this data, will it actually be used in making a decision?” before any work begins.
Why Data Analysts are still important
by Erica Louie
I’m looking forward to having Erica as a guest author on this Roundup at some point in the near future to dig more into this, and so I intend to say very little here in anticipation of that :) However, I encourage folks to read the thread and participate in the conversation that is still evolving this weekend. There’s some very good and some 🌶️ takes in there, and it’s a good time!
Visualizing SQL joins without venn diagrams
by Andreas Martinson
This article is actually from April, but I’ve just discovered it thanks to some recent retweets. I was surprised by how divisive this article was on Twitter between the venn diagram camp, and the folks who really liked Andreas’ visual primer instead :)
My brain LOVED this visual because that’s exactly what my brain does already whenever I write a join statement.
What do you think?
And finally, one of the best data visualization’s I’ve seen this year from the New York Times. It is sobering, representative of the scale of the problem it communicates, and absolutely incredible in its level of detail — the animation transitions alone are worth going back and forth in this narrative again and again. 💔
That’s it for this week folks! As always, hit me up in the comments or reply to this e-mail with your takes.