Discover more from The Analytics Engineering Roundup
It is people. It is people. It is people.
What we learned from Coalesce 2021: why most data problems are about people, how to meet people where they are, and learning from open source to empower decentralized data advocacy.
This week, we did a thing. Coalesce 2021 finally came together, live and online with over 5,000 humans tuning in.
You can be forgiven for wondering if Coalesce is, in fact, a data tech conference or something else entirely. Depending on when you tuned in to Coalesce this week, you may very well have started your experience with a purple shaman, an elf 🧝♀️ , a talking cookie, or a visual of Chris Jenkings chanting in Maori:
He Tāngata. He Tāngata. He Tāngata.
I won’t attempt to summarize the 41 hours of content in this Roundup. Instead, I’m more interested in zooming out and taking it all in. What did we accomplish here this week, as a community? For whom? And in what ways does it matter?
This is just the beginning
Chris Jenkin’s session on Tāngata, an open source library powering an editable data catalog, is emblematic of the larger themes in the conference this week: open data tools and using them to build better human interfaces.
Somewhere around 12 mins and 20secs in the replay, Chris tells the story of a Maori proverb that, when roughly translated, reads:
What is the most important thing in the world? It is people, it is people, it is people.
He Tāngata. He Tāngata. He Tāngata.
This is a theme woven into all of the content presented and the conversations that were started this week (and are still going!).
It is people.
Many of the talks this week reflected on what can sometimes feel like absolute chaos in both the ways our tools are evolving, and the ways our roles as data practitioners are evolving.
On Monday, Tristan Handy and Martin Casado talked about “How big is this wave?” in reference to the massive explosion of tooling in the modern data stack. My biggest takeaway from their conversation was that we’re not yet done with changes in our tools, our jobs, and our industry. As the market continues to evolve, Martin says, it will create vacuums, and new tools will continue to emerge to fill those gaps. And in fact, the rate of change might still be accelerating. As Alan Cruickshank put it in a later session, this is just the beginning.
“Dealing with data is dealing with the complexities of the universe”, Martin says. “The limit to which you can apply data to is the natural world, and there is no limit to the applications that you can build on top of data”.
I didn’t fully internalize the magnitude of what Martin is saying here until I watched a replay of the conversation. Applications built on vast amounts of data will be as ubiquitous tomorrow as software is around us today. Today we can no longer imagine a world without programmable interfaces, and we are just beginning to connect them together. Imagine the possibilities when our data application stack becomes as robust as the software stack is today, and the sheer volume of data we have collected starts to be combined in new and interesting ways.
The more our data stack follows the evolution of software (that is, away from massive all inclusive IDEs towards highly modularized open source libraries and frameworks), the more this will enable specialized data applications to flourish. The more data is collected and combined, the richer those applications will be in solving human problems. There are murmurings of this today: Google Maps not only navigates traffic for you, it now also offers eco-friendly driving routes that use less gas. But only Google has access to data that powers this information today. Imagine if citizen scientists did, and they also had the tools to easily build on top of this data. What interesting applications will they build to help tackle climate change?
Are we a few years or a few decades away from this world? We have no way of knowing. But we can predict that the chaos we’re feeling is most likely ahead of us for years to come. How do we navigate this growing complexity? How do we adapt as we redraw the map of what’s possible with data in orders of magnitude every year?
By meeting people where they are.
It is people.
Meeting people where they are
On Tuesday, we argued over whether “Data Scientist” is a still a helpful title, or whether it is holding us, as an industry, back from where we want to go next. Reading Emily Thompson’s rebuttal to Emilie Schario’s talk made me wonder — what if it’s both?
Emilie Schario makes a strong case that the “Data Scientist” job title is often used as a kind of escape valve when organizations don’t really know what they need from their data teams. Instead, we as practitioners should be encouraging our current and future bosses, HRBPs, and employees to think harder about what we need data humans to do, and create appropriate career paths for them to do this. If you are in a position privileged enough to shape these conversations, you must absolutely do so.
At the same time, not everyone is there yet. As Emily Thompson points out, many organizations genuinely do not yet know what they need from their data teams. Their bias is to hire generalists who can do a little bit of everything and help them figure it out. Depending on who is doing the hiring, they might hire a Data Scientist first, or a Data Engineer (although Stef Olafsdottir makes a convincing case for why you shouldn’t start there).
And while all this is continuing to evolve, Data Scientists and any data practitioner with “Engineer” in their title still earn more money than Analysts, despite very often doing similar work. Our arguments over names are really arguments over the value of the work humans do, and the recognition of that work through meaningful compensation and career paths.
It is people.
To properly address this problem, we need to meet people where they are today. We, data leaders, need to help navigate the chaos that lies ahead by focusing on the people, what they need from us in that moment, and what our organizations need in that moment.
Everyone is building this ship together as we’re sailing it. Some of us are further along than others, but we’re all in the same boat.
How do we do this?
Open source and interfaces
In “How big is this wave?”, Tristan and Martin talk about enabling decentralized and distributed data tooling (read: more different kinds of data tools) that all work independently through common interfaces. We’ve learned from examples in software engineering that the only way to do this is through open access to source code and collaborating on open standards.
We’re trending in the right direction here, tooling wise. As Katie Hindson showed us this week, you can already build an entire data stack on top of open source tools today. dbt is at the heart of this stack, or as Martin put it, “the glue that stitches together this massive expanse of the data industry”. It is imperative that the right elements of dbt remain open to the data Community to help all of us get to this future.
Tristan has written this week about how we plan to keep new features of dbt as open as possible while trying to build a sustainable business. TL;DR: The much anticipated metrics definitions and the ability to compile them will land in the open source dbt Core project because this layer must become an open standard in our industry. And the remaining components that you’d need to run your own service using dbt Server will be source available, keeping this technology open for the Community to evolve together.
We’re well on our way to breaking down barriers in our tooling and developing common data interfaces and standards.
But what we learned this week at Coalesce is that we're less good at breaking down barriers and developing interfaces between the people in our organizations. Something else Emily Thompson wrote that stood out to me:
The real problem I see is that we as an industry haven’t figured out how to optimally integrate science-practitioners with product developers, at least not consistently.
I agree. Also, how to integrate data practitioners and marketing wizards, data practitioners and sales champions, data practitioners and customer support heroes, data practitioners and [insert company function here].
There are really only a few pathways to solving this problem of human integration:
Bring more and new kinds of data humans into the various functions you need. For an example on hiring new roles, see “New Data Role on the Block: Revenue Analytics” from Celina Wong. For some advice on training for new skills, see ”Git for the Rest of Us”, from Claire Carroll. For advice on how to properly set up a hiring pipeline to build a diverse team, see “Beyond the Black Box” by Abia Obas.
Build better tooling that acts as an interface for humans to collaborate with each other, or create more options for folks to use data where they work. Make it easier for folks to focus on the thing they need to do together, and abstract the rest. For some examples see:
“So you think you can DAG? Supporting data scientists with dbt packages” from Emma Peterson.
“The future of Analytics is Polyglot” from Caitlin Colgrove.
Change what you incentivize in folks’ roles to encourage the right interfaces to emerge. Hire for data literacy at the decision making layers, and incentivize everyone in the organization to be accountable to data goals. For an example, see “Scaling Knowledge > Scaling Bodies: Why dbt Labs is making the bet on a data literate organization”.
Which one is the right path? They all are. Managers and Directors of functional areas have the power to influence #1. ICs on data teams have the power to influence #2. Leaders at every layer of your organization have the power to influence #3.
Decentralized, and distributed data advocacy (that is doing #1, #2, and #3) is how we will make slow and steady progress amidst the chaos that this new wave of tooling will introduce. But we don’t have to have all of our ducks in a row 🦆🦆🦆 — change starts in any one of these areas, regardless of where you sit in your organization. And there are something like 77 sessions with ideas for you on how to enact change where you are.
Where will you start?
Elsewhere on the internet…
Benn Stancil’s take on the Analytics community, post Coalesce 2021:
The analytics community’s success is proof that, even if software eats the world, software culture doesn’t have to. It’s proof that the good things about the tech industry—the ambition, the impact, the financial opportunity—aren’t the causes of the bad. It’s proof we can throw out most of the bathwater, and keep the baby.3
Given how much the data industry tries to learn from software engineering, maybe it’s time to go the other direction. Maybe it’s time for community leaders to push outward, into the tech industry and into its newer adjacencies, and to start teaching rather than learning. There are a lot of dark spots in tech, and what a light, my lord, is needed to conquer so mighty a darkness. But the best in this community have that brilliance. It’s time others get to see it.
💜 If this doesn’t capture the vibe of the week, I don’t know what does. Thank you for this write up, Benn!
And although a lot of our corner of the data sphere was immersed in Coalesce this week, other things happened as well 😉:
Databricks launched their own VC fund
Congrats to folks over at Databricks! Excited to see what brave new spaces will be created by the companies they fund 👏👏👏
The Concept Layer
by Ian Fahey
This is an interesting take on building a shared internal knowledge graph of a business. Despite the similarity in name, this is not actually a take on the metrics layer as a data product. It's one data team’s strategy for the delicate balancing act between standardizing data definitions and keeping clean, canonical data easy to remix and explore. Their solution: defining a canonical set of tables that represent the most important business entities and pairing this with event tables and standardized dimensions to create structured, easy to explore data that's consistent across the business. ✨ Very neat stuff!
Also this week, Ashley Sherwood started a substack. The latest post on the hard work of having an opinion is a must read.
We’ve talked a lot in the past weeks about bringing data professionals into the room where decisions are made. But what do we do once we get there? How do we make our case based on data? How much opinion should we have? Ashley breaks it all down for us. Some of my favorite bits:
Use data where it’s useful, but look for business, process, human, and ethical constraints as well.
Data is not a panacea and just because data says something doesn’t mean it’s always the right thing to do. Hear hear!
I also appreciated the distinction between making an evidence based claim and an evidence dump. The latter is when your data professional rattles off a bunch of facts, and leaves you on your own to divine a recommendation (something all of us have been guilty of at some point).
Good opinions are mutable—they’re open to change in the face of new, different information.
This is important. You don’t have to know everything to make an evidence based claim. You need to know just enough and state your assumptions and context, which may very well change over time. And you don’t have to qualify your claim with every assumption you make. Just be prepared to pull up more data to support your claim.
On how much is enough, this is probably my favorite part of her piece:
If you tend to err on the side of hasty judgment, take a moment to intentionally look for evidence that contradicts your opinion. […] If you tend to err on the side of endless hours of research before making an opinion—stop! Make an opinion right now. Even if you don’t have any information. Stretch your brain, do your best, try to logic it out. With every new piece of information, pause and reformulate an opinion. This is good for your critical reasoning muscle, and means you’ll never be caught without at least some opinion.
On opinions vs preferences:
You may find that the quality of your opinions increases when you no longer feel like you need to justify your personal preferences because it is entirely reasonable to hold an opinion that contradicts your preferences.
For example: “I prefer reading articles, but with the evidence that more people learn better from video presentations, I think we should invest department budget in recording more video training for our content.”
Being able to separate your own priors from data on what others need is so important, and what makes for a truly great data professional. You’re not being inconsistent, you’re being self-aware, empathetic and acknowledging your biases.