Devops and the Modern Data Experience
Devops has been an unbelievably productive development in the world of software engineering. What do we still have to learn from it in data?
We’re doing our first-ever AMA (ask me anything) episode on the podcast! Here’s what I need from you: either respond to this email or to this Twitter thread with your questions. On anything—data, life, quantitative easing, whatever! I’m a very open book and am happy to go wherever you think would be interesting.
Also, the most recent episode is out! Meme lord Seth Rosen joined Julia and me. Seth IRL is much less snarky but still very insightful—the conversation was an absolute pleasure to record. Get it here.
What can we learn from devops?
I’m using this issue as a way to write the talk I’m giving at Future Data 2021. Peter and team, sorry that my slides are late! Seems like most things I do are late these days 😬.
Before reading this post, make sure to read Benn’s post, The Modern Data Experience. In it, he talks about the problems that the modern data stack faces today. Specifically, its failure to cater effectively to users who need data to do their jobs but do not consider themselves “data people.” One of his prescriptions for the future is more integration.
To me, the modern data experience is a melting pot. Data stacks have a history of building walls to throw things over: data engineers throw pipelines at analysts; BI developers throw reports at their stakeholders; analysts throw results at anyone who will listen. Modularity can’t tempt us to build more walls. Just as dbt broke down the first wall, a modern data experience needs to break down the others by encouraging collaboration and conversation between business, data, and engineering teams. In other words, in the shorthand of the moment, the modern data experience is purple.
The problem with a “best of breed” approach to building a stack is that the end-to-end user experience will suffer unless those best-of-breed products are very well integrated. And products in the modern data stack are not well integrated. They exchange information indirectly, via SQL tables and views. One tool creates the source tables, another tool transforms them, creating new tables, a third reads those tables and helps users analyze them. There is no shared fabric to exchange metadata beyond the names of relations and columns. As a result, using each layer feels very disjointed from using the others. Looker users have very little context for what might be happening upstream in dbt and further upstream in Fivetran, for example.
Perhaps more importantly, the data tool more widely used than any other (Excel) is completely disconnected from the modern data stack. Have you ever queried Snowflake from inside of Excel? People have done it, but rarely, and it is not a good experience.
I also want to highlight one other section of Benn’s post:
(…) data cultures don’t materialize out of employee handbooks or internal seminars. The structures of our technologies and organizations till the land from which they grow.
As Facebook and its algorithmic-attention-magnet compatriots have reinforced to us over these past years, the medium is the message. We are shaped by our communication media—our thoughts, our work product, and our very identities. And in data, our tooling is fundamentally collaborative. Git is a communication medium…it helps us communicate code. dbt is a medium to express transformations on datasets.
If our communication media are disconnected, we will be disconnected.
Imagine your company today as a human society where only half the population can read, one tenth can write, where half a dozen languages are spoken, and where most of the books in the library contain things that once were true but have since been outdated (but you don’t know which ones). Not a highly productive information ecosystem.
If the project that we’ve been on for the past decade has been technical—being able to store and process large amounts of data—the project of the next decade is to create a system of knowledge creation and curation accessible to all knowledge workers. Everyone must be able to read and write, and knowledge must be organized, searchable, and reliable.
This statement of the problem doesn’t seem to lead obviously to “so let’s talk about devops!” Most folks outside of software engineering and infra tend to hear devops and think about technology…about Kubernetes, Docker, Terraform. And while the tools are interesting, what’s fascinating to me is the principles behind them.
Even though data is more aligned with devops than it was half a decade ago, I think we still have a lot to learn. My three lessons:
Create tooling that brings different kinds of people together
Refuse to click buttons
Kill your darlings
Create tooling that brings different kinds of people together
The core of devops is bringing people together, as you can intuit from the portmanteau name. “Developer operations” combines software development and IT operations to create a set of tooling / standards / workflows that both software engineers and infrastructure engineers can collaborate on.
In the language of product management, devops solves problems by designing for two personas instead of one. Software engineers previously had to file tickets and wait for infrastructure engineers to build their staging environments, deploy code, and much more. Now, both software and infrastructure engineers collaborate together using shared tooling to create and manage infrastructure. No bottlenecks or religious wars, just shared headspace and collaboration.
There are many tools in the data ecosystem that are built for multiple personas. Looker is built for three—the data modeler, the report builder, and the report consumer. These don’t have to be, but often are, three distinct humans. dbt is built for two personas: the data engineer and the data analyst. It allows both of these humans to work together, combining their expertise to create stable pipelines with minimal friction.
Let’s use that thinking to imagine how we could cross what is probably the biggest divide in data today—the divide between data analysts and business stakeholders:
Without a powerful, flexible tool for data consumers to self-serve, the promise of the modern data stack will forever be for a select few.
In that post, I talk about the potency of the spreadsheet UX and how the modern data stack hasn’t seen such a powerful interface yet. Seen through the lens of devops, I think we can sharpen this question:
What would a truly multi-persona spreadsheet look like?
Imagine a spreadsheet that is built for both the business stakeholder and the analytics engineer. It’s Excel in the cloud, plus:
a native interface to querying a data warehouse, outputting results to a sheet.
the ability to natively participate in the dbt DAG by using `ref()`, and the ability for its data to be referenced by downstream computation in the data warehouse(!).
support for classic Excel-style formulas or R and Python functions.
a file format that natively separates data and computation and presents code in a way that can be easily code-reviewed by another human.
…etc. You get the idea.
My point is: this is what a multi-persona spreadsheet could look like, where the two personas are the Excel user and the analytics engineer. You can see bits of this starting to come together in what Bobby Pinero and team are building at Equals.
The problem with spreadsheets has always been their complete and utter lack of operational maturity. What if adding another persona to the mix could help? What other gaps could we close by building tooling that appeals to two personas instead of one?
Refuse to click buttons
If you’re constructing a system meant to produce highly-replicable results, humans cannot be a part of that system. Humans are…unpredictable. Computers are much more reliable. Computers also have a rather significant cost advantage :)
In devops, all systems are described in code. This is known as infrastructure-as-code, and it is one of the most central tenets of the field. This characteristic of devops enables so many of its wonderful traits: governance (infrastructure is version-controlled), replicability in dev-test-prod, robustness…
This hard requirement that all systems be described in code means that every single tool in the modern software engineer’s tool belt must be completely controllable via its API.
In practice, devops practitioners don’t just open up a Python file and start writing scripts; they tend to use Terraform to express their infrastructure. And if you’re familiar with dbt, you’ll find Terraform pretty familiar: Terraform was the inspiration for dbt.
The interesting thing about Terraform that feels relevant to data today is that Terraform wouldn’t work if the various APIs it orchestrates didn’t support the right operations.
Terraform doesn’t actually create and destroy EC2 instances or configure VPCs or, really, do anything itself. Rather, Terraform knows how to call APIs that AWS or GCS or Azure (or Fivetran!) provide to perform the actions it needs to perform. Terraform has a bunch of providers that understand how to call the APIs of the various services it orchestrates, and at this point most Terraform providers are now community- or vendor-supported. In the early days the Hashicorp folks (who make Terraform) built the initial providers themselves without any permission from the platforms that they were orchestrating.
This is one of the wonderful things about software engineering. If you make something programmable (via great APIs), you enable users to build layers of abstraction on top of it, finding new and exciting ways to use the product. These users can do so without needing permission from you to do so. The folks at Hashicorp never asked permission from AWS, and it’s not clear that, if they had been forced to, they would’ve gotten it. Permissionless innovation is wonderful for an ecosystem, and it’s fundamentally unlocked by APIs and standards.
In data, the APIs of our leading products are not as mature. They exist, but are often secondary to the graphical user interfaces.
dbt Cloud exposes a reasonably feature-complete API but using it doesn’t present a fantastic developer experience today. This is an area of focus for us. There is community-maintained Terraform provider but it’s not getting much use today.
Fivetran has a nascent Terraform provider (linked above), and you can see that the team is iterating on it actively (kudos!).
Looker has famously excellent API coverage to control the behavior of an instance but doesn’t make it possible to script the creation of an instance itself as far as I know. There are a few community-maintained Terraform providers but nothing that looks mature.
Snowflake, Bigquery, and Databricks all have mature APIs and Terraform providers.
So: we’re not there yet across all layers of the stack but there is movement in the right direction. Providing an excellent API should be a focus for each tool vendor in the space and choosing tools with excellent APIs (or who are committed to this vision) should be important to forward-thinking buyers.
I’m genuinely excited to see what innovation will be unlocked once the entire stack is scriptable. The automation of environment creation will be only the first of many use cases.
Kill your darlings
I write a lot these days…probably more today than I ever have. You know when I wrote very little? When I had to write with a pencil and paper (for me, prior to my first Squarespace account in 2003). The worst part about pencil and paper is how annoyingly hard it is to erase something. Sure, erase a word, no problem. You may have some eraser crumbs to brush away, but not an issue. What about when you realize the last two paragraphs sucked and you need to trash them? Ugh. In pencil-and-paper world you have to transcribe the top of the page onto a new sheet of paper and then crumple up the prior version and trash it.
The funny thing about the ability to delete things is that it makes you less attached to each individual new idea you write. Ideas never come fully-formed; it is in the writing—and the editing—where they come to life. Making it easy to delete makes it easy to start writing; it also makes it easier to do a good job of editing. If you’re not attached to your ideas, you don’t worry about killing them off (literally or figuratively). This is neatly summarized in literary tradition as “kill your darlings.” Stephen King put it well:
kill your darlings, kill your darlings, even when it breaks your egocentric little scribbler’s heart, kill your darlings.
Software engineering works the same way. Over the last twenty years, Agile has taken over from Waterfall as the default methodology for developing software. In the loosest possible terms, the difference between the two approaches is that Agile recognizes that it is not possible to know everything at the start of a project. Instead, it is antifragile to new information—new information makes the final outcome stronger instead of weaker. Said another way: agile encourages editing. And generally, we should strive to practice agile when we manage data projects.
In devops, there is a metaphor to summarize this type of thinking: pets vs. cattle. It is rather a harsh metaphor and one that you might not love, but it is…effective:
In the old way of doing things, we treat our servers like pets, for example Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.
Devops attempts to move all systems from pets to cattle, from systems that are hand-fed to systems that can be reasoned about in large numbers. Fundamentally, this enables software engineers to hit the delete key. Don’t like the way something is working out? Trash that branch and check out main. Or revert the last two commits. Want to re-think your networking? Delete that section of the config, rewrite it, run `tf apply`. Nothing is set in stone, so get on with creating and edit as needed.
In data, this is not our reality today. Iteration on widely used analytical assets is limited for fear of breaking production, because there is no good workflow in a dashboard (or an ingestion pipeline) for branching, CI/CD, PR review. There have been attempts made to deliver this type of experience, but I haven’t seen one that accomplishes it cleanly. At some level they all seem to break down to “duplicate this thing” which doesn’t really help as it leads to a proliferation of disconnected assets, not a single history and canonical version.
I don’t know exactly how to solve this problem in a way that’s consistent with the inherently-visual nature of many analytical products. I do know that if we don’t allow ourselves to more freely experiment we are fundamentally limiting our ability to form ideas. Writing is thinking, after all.
Ok that’s plenty for a newsletter. Sorry for inundating you with a wall of text :P There are a few sections of this thinking that I’ll save for my talk—sign up and join me live if you’d like to go deeper! Also, here’s a great Twitter thread on this topic:
Elsewhere on the internet…
If you are a Head of BI & Data, in most cases you should really see yourself as a Head of Questions & Answers. You want to have a good grasp on which questions your organization is asking, you want to find ways for your organization to answer questions at scale, to enable people to interpret the answers and create a culture that encourages them to ask more and better questions.
You are not Head of Dashboarding & Reporting.
🪑 Benn Stancil on why so many of our conversations about our own profession are superficial:
[The secrecy of our work product], however, makes our job as analysts a lot more difficult, especially for those just entering the field. To extend Randy Au’s woodworking analogy, the veil around analytical work forces us to talk about the saws we prefer, the types of wood we like, and the paths we can take through our apprentice program without talking about the actual chairs we build.
Totally agree. Gosh I could fill a book with the actual business insights I’ve worked on over the past decade, but that would generate rather a lot of lawsuits. So we’re all limited to talking about the more meta. My guess is that Substacks published by structural engineers talk about amazing buildings and bridges, not about how to choose a really great CAD tool.
👩💼 Very interesting post on the idea that dashboards don’t get created to make business decisions, they get created to address emotional needs.
Why do they ask for a dashboard, then? The answer, which this product leader delivered with a grin, is that the exec often has an emotional need to feel on top of the business. “What this means is that it's not enough to say to the exec ‘oh, you don't want a dashboard, you want a thing that compels action’; you have to think about how to address that emotional desire at the same time.”
I’m totally sympathetic to this idea and definitely find myself checking dashboards at times simply to satisfy an emotional need. But it’s also a very cynical perspective that gives very little credit to the executive persona. Not sure how to feel about this overall.
Last note before hitting send…I hope you’re as excited as I am about what’s going on in our industry right now. I had a conversation yesterday with the CDO of a 60k employee company and he just really really got it and wanted to help convert his peers in other companies to The Viewpoint as well.
There is really special stuff going on right now; practitioners have a voice, bigcos are joining the party, there is real movement on decades-old problems. I feel like the entire industry is a powder keg just waiting for the first big in-person industry conference to happen where we can all get together in one place and come back three days later having figured it all out :P
- Tristan :)