Data lineage layers. Collaborative analytics. Over-engineering data products.
Also in this edition: the definitive guide to cohort retention, being a better data team manager and a first hand account of how to get started with a career in data.
There are SO many great articles this week that I’m going to skip the traditional think piece and jump straight into them!
In this week’s edition:
The Many Layers of Data Lineage (Borja Vazquez)
The Qualified Scientist (Emily Thompson)
How I learned to stop worrying and love being a manager (Brittany Bennett)
How to measure cohort retention (Olga Berezovsky)
Not everything data related should be a product (Eric Weber)
But first, a very important Coalesce announcement:
-Anna
The Many Layers of Data Lineage
By Borja Vazquez
The Google Maps analogy is such a deeply satisfying way of describing what’s missing today in analytics engineering user experience, and also what’s possible.
What if we could visualise traffic jams between models the same way as Google Maps does between cities? It feels natural that it should be possible to overlay all the metadata collected on top of a DAG.
Here’s why I think this idea is so cool:
It shifts the conversation towards imagining what better analytics engineering UX actually looks like. As a community, we’ve spent a good deal of time thinking about the economics of a growing set of modern data stack tools, but relatively less time on first principles thinking around what end users need from their workflows. This article is a great example of the latter!
Data Lineage Maps as first class citizens in our day to day work as a single source of truth everyone can consume, analyse and understand without the burden of having to learn new tools or asking colleagues
I can’t agree more!
Better analytics engineering UX also means a better experience for everyone else who works with or uses data. We often talk about “self service data” and the work that goes into actually making this happen — extensive documentation, education, etc. I think that the easier and more intuitive it becomes to visually explore and make meaning directly from data lineage, the closer we’ll get to a world where self-service analytics is a reality. Borja absolutely nails the principles that will make this vision into a reality.
Principle #1: Extensibility
The best thing about the analogy Borja uses is that it is an API that allows you to continually expand and make meaning on top of data lineage in novel and interesting ways. Imagine being able to overlay directly things like:
whether data is production grade (expected to be trustworthy) or more ‘use at your own risk’ without having to infer this through multiple clicks and pages of documentation
which datasets are used downstream in a BI tool, which are used in reverse ETL workflows and even what applications data is sent out to
the last time the code powering a particular dataset was updated
Borja’s example in the article of overlaying a slowest path is also an extremely powerful one for anyone who needs to debug and troubleshoot a data pipeline for speed or cost. dbt is starting to get at this idea with the model timing tab but Borja’s example is making me think about where else in the UI this type of information is most useful to the end user.
What would you want to see overlaid on top of your data lineage if you could?
Principle #2: Varying granularity
I think this is the second killer principle of Borja’s post. The ability to not only overlay information but also reshape the lineage visualization to surface different levels of insight is a huge unlock.
Just like sometimes I want to zoom down to street view or zoom back out to country level insights with Google Maps, I want to be able to do the same with my data lineage. I want to be able to zoom all the way into the code powering a particular transformation, and as far out as the business units being powered by datasets.
Borja’s example of the business layer is very compelling — increasingly I find myself thinking about the data my team maintains as a portfolio. Zooming out to examine the business layer (and overlay information on top of that like % of coverage, overall pipeline health, average delivery time, # of outstanding tickets or bug reports etc) gives me the ability to effectively manage and prioritize the work to support the growing portfolio.
Borja — I can’t wait for your vision of the future to become a reality. This is exactly what I, and I think many others, have been missing.
The Qualified Scientist
By Emily Thompson
Emily’s writing is always excellent, and something I look forward to every week. This week in particular, Emily writes very eloquently about something I strongly believe in:
Collaborating on analysis is not just for data practitioners.
What if we started acknowledging all contributions and created an intentional culture for the entirety of the internal “analysis collaboration?” And not just limited to the people we think of today when we hear the term “data practitioners” but really everyone who creates or touches the data needed to make an analysis successful. With that lens, I would also include other groups not traditionally thought of as part of a data team, such as software engineers responsible for event instrumentation and IT teams responsible for 3rd party tools and integration as equal members of this internal data collaboration.
Emily also nails the challenge most data teams experience in making the above world a reality. Engineering ownership of upstream data sources isn’t just a buy-in problem, it’s also a visibility problem:
we’re still kind of missing the people aspect of the problem. Why would a software engineer ever be excited about getting instrumentation right if they never see the downstream business impact of the telemetry they added to product features?
How would a data team’s relationship with upstream teams change if those upstream teams could easily see the impact of their work on anyone using it downstream?
It’s not at all a novel concept in software engineering. Any sufficiently low level open source project with a large set of dependencies (e.g. a common JS library, the Linux kernel that is foundational to most servers today, etc.) has to consider the downstream impact of changes made to their code. There are well established practices for doing so safely, because the ecosystem that depends on a particular project is the reason for the projects continued existence. Most use some concept of release versioning, but some maintainers go as far as automating pull requests that actually make changes to downstream code to be compatible with a new release.
Things like GitHub code search make this possible across many open source repositories. The big question is how can we make this possible for data workflows?
We can start by using the same tools. If your engineering team works in GitHub or GitLab, say, and you do too, it’s easier to cross reference code.
We can make it easier to visualize downstream impact. This is where Borja’s idea of lineage layers (above) comes in — what if you could build a lineage layer that shows the downstream impact of a particular Snowplow event? At least, you’ll have visibility into what breaks during runtime, and ideally you can share the visualization with your engineering counterparts and collaborate on changes together.
what issues would disappear if we structured ourselves in such a way so that everyone involved, no matter how remotely, felt like an “author” of every data analysis?
I’m a big believer in socio-technical systems, that is, the idea that both the way we structure ourselves as humans and the technology we use shape outcomes. I also think it’s easier to build new technology than to change behavior, so the spin on this question I would ask is:
what issues would disappear if we build data tools in a way that helped everyone involved, no matter how remotely, see their own impact on every data analysis?
Off the beaten path: Megan Lieu
I’m such a big fan of the “Off the beaten path” series from Madison Mae (and not only because the very first edition featured our very own Amy C! 😉). I’m a big fan because these are important stories to tell — no two humans I’ve worked with in my career in data had exactly the same background or story that brought them to their data career.
I’ve worked with former accountants, theoretical physicists, students of social movements and agriculture, army vets, consultants, teachers… all of whom were equally excellent at doing their jobs in data. I’m personally a communications major and recovering academic.
What this experience taught me is that when it comes to a successful data career, having a certain specific background is far less important than the transferrable skills you’ve built along the way. Your variety of prior experiences give you perspective that someone else on the team doesn’t have, and this allows you to look at the business with a different lens. I’ve written before about how the business context someone brings to a team is more important than their depth of technical expertise, and I think this is multiplied through different backgrounds on a team.
Megan’s story is very cool and worth reading in full. It’s easy to read it though, and think that you have to do something very special like go viral on social media, in order to land your first data job. I don’t think that’s the message at all.
I think the message is that it can happen for you too if you learn to start using data to solve problems that you see as important. Because your unique background and experience applied to data problems is what is most interesting, valuable and special. Don’t be afraid to show it off! ✨
How I learned to stop worrying and love being a manager
By Brittany Bennett
Speaking of data humans with interesting, valuable and special backgrounds… 😉Brittany Bennett has an insightful article for us this month reflecting on her experience becoming a manager.
I appreciate the honesty of this piece. Just like in life we worry about reproducing the mistakes of our parents, in work, we worry about reproducing the mistakes of our former managers.
Brittany shares a few techniques she learned to build accountability on her team while balancing personal autonomy. Striking this balance is very very hard, and her writing has reminded me of some of the techniques I’ve appreciated from former managers in the past, like bi-weekly accountability reports embedded into a project tracker. This might seem obvious when someone says it out loud (project updates, big deal! psh!) but is actually incredibly efficient and effective if set up well. There’s a kind of art to designing an easy reporting mechanism that doesn’t introduce too much overhead for your team. Brittany has figured out a way to do this and it worked well for her team.
Thank you for sharing your experience, and for another great article, Brittany!
How to measure cohort retention
By Olga Berezovsky
Yes, yes yes and yes! 🙌
Cohort retention is such an unbelievably common problem in analytics but also an unbelievably squishy one.
Olga does a great job describing why this is squishy — your retention measures vary depending on your product and business model. Olga has some great rules of thumb regardless of whether you’re working with B2C or B2B, subscription based SaaS, one off transactions or ad supported services.
She also points out that there are several ways of measuring AND visualizing cohort retention that are equally valid — so how do you pick one?
I agree with Olga that cohort retention charts like this one are the most informative:
Before Olga’s article I would have advocated most strongly for X-day retention measurements, but she’s convinced me that’s probably because of the kinds of businesses I’ve worked with in the past. 🤓
Read it if you’ve always wanted to do more cohort analyses.
Read it if you already think you know what you’re doing too — there’s probably something in here that you haven’t thought about before.
Really excellent post, Olga!
Not everything data related should be a product
By Eric Weber
Last but certainly not least, some thoughts from Eric on over-engineering data products:
The key question here is not whether something can be a data product in an organization. The answer is that with enough engineering, data science and product support, it probably can. The question is should it become a data product? This is where the role of a data product manager and leaders in the company really matters - they should decide not only what to invest in, but explicitly define what is not worth investing in now but may be worth investing in later.
I feel that I must at this point plug one of my favorite XKCDs: The General Problem:
It’s funny because it’s true.
It’s true because engineers are incentivized to over-engineer. When your personal career path progression is defined by solving increasingly complex problems, you’re incentivized to over-engineer. When the goal of your team is to build systems, everything is a nail.
And the more engineering best practices we adopt on data teams, the more we adopt the incentive to over-engineer our data products too.
Eric provides some good questions to ask ourselves whenever we think about building a new product/system/framework, and I’ve personally found them successful in helping to manage this challenge in the past:
There are myriad questions that can help define this boundary like “who in the organization needs this capability?”, “how many people in the organization need it?”, “what is the return we get for going from manual to a product that supports it?”, “what is the cost?”.
To Eric’s point, the most important shift to make is not which questions to ask, or how to arrive at the decision framework, but to even ask the question at all.
That’s it for this week!
If you made it this far — WHEW! I told you there was a lot of good stuff this week ;)