Devops and the Modern Data Experience
Devops has been an unbelievably productive development in the world of software engineering. What do we still have to learn from it in data?
đ
Weâre doing our first-ever AMA (ask me anything) episode on the podcast! Hereâs what I need from you: either respond to this email or to this Twitter thread with your questions. On anythingâdata, life, quantitative easing, whatever! Iâm a very open book and am happy to go wherever you think would be interesting.
Also, the most recent episode is out! Meme lord Seth Rosen joined Julia and me. Seth IRL is much less snarky but still very insightfulâthe conversation was an absolute pleasure to record. Get it here.
- Tristan
What can we learn from devops?
Iâm using this issue as a way to write the talk Iâm giving at Future Data 2021. Peter and team, sorry that my slides are late! Seems like most things I do are late these days đŹ.
Before reading this post, make sure to read Bennâs post, The Modern Data Experience. In it, he talks about the problems that the modern data stack faces today. Specifically, its failure to cater effectively to users who need data to do their jobs but do not consider themselves âdata people.â One of his prescriptions for the future is more integration.
To me, the modern data experience is a melting pot. Data stacks have a history of building walls to throw things over: data engineers throw pipelines at analysts; BI developers throw reports at their stakeholders; analysts throw results at anyone who will listen. Modularity canât tempt us to build more walls. Just as dbt broke down the first wall, a modern data experience needs to break down the others by encouraging collaboration and conversation between business, data, and engineering teams. In other words, in the shorthand of the moment, the modern data experience is purple.
The problem with a âbest of breedâ approach to building a stack is that the end-to-end user experience will suffer unless those best-of-breed products are very well integrated. And products in the modern data stack are not well integrated. They exchange information indirectly, via SQL tables and views. One tool creates the source tables, another tool transforms them, creating new tables, a third reads those tables and helps users analyze them. There is no shared fabric to exchange metadata beyond the names of relations and columns. As a result, using each layer feels very disjointed from using the others. Looker users have very little context for what might be happening upstream in dbt and further upstream in Fivetran, for example.
Perhaps more importantly, the data tool more widely used than any other (Excel) is completely disconnected from the modern data stack. Have you ever queried Snowflake from inside of Excel? People have done it, but rarely, and it is not a good experience.
I also want to highlight one other section of Bennâs post:
(âŚ) data cultures donât materialize out of employee handbooks or internal seminars. The structures of our technologies and organizations till the land from which they grow.
As Facebook and its algorithmic-attention-magnet compatriots have reinforced to us over these past years, the medium is the message. We are shaped by our communication mediaâour thoughts, our work product, and our very identities. And in data, our tooling is fundamentally collaborative. Git is a communication mediumâŚit helps us communicate code. dbt is a medium to express transformations on datasets.
If our communication media are disconnected, we will be disconnected.
Imagine your company today as a human society where only half the population can read, one tenth can write, where half a dozen languages are spoken, and where most of the books in the library contain things that once were true but have since been outdated (but you donât know which ones). Not a highly productive information ecosystem.
If the project that weâve been on for the past decade has been technicalâbeing able to store and process large amounts of dataâthe project of the next decade is to create a system of knowledge creation and curation accessible to all knowledge workers. Everyone must be able to read and write, and knowledge must be organized, searchable, and reliable.
This statement of the problem doesnât seem to lead obviously to âso letâs talk about devops!â Most folks outside of software engineering and infra tend to hear devops and think about technologyâŚabout Kubernetes, Docker, Terraform. And while the tools are interesting, whatâs fascinating to me is the principles behind them.
Even though data is more aligned with devops than it was half a decade ago, I think we still have a lot to learn. My three lessons:
Create tooling that brings different kinds of people together
Refuse to click buttons
Kill your darlings
Create tooling that brings different kinds of people together
The core of devops is bringing people together, as you can intuit from the portmanteau name. âDeveloper operationsâ combines software development and IT operations to create a set of tooling / standards / workflows that both software engineers and infrastructure engineers can collaborate on.
In the language of product management, devops solves problems by designing for two personas instead of one. Software engineers previously had to file tickets and wait for infrastructure engineers to build their staging environments, deploy code, and much more. Now, both software and infrastructure engineers collaborate together using shared tooling to create and manage infrastructure. No bottlenecks or religious wars, just shared headspace and collaboration.
There are many tools in the data ecosystem that are built for multiple personas. Looker is built for threeâthe data modeler, the report builder, and the report consumer. These donât have to be, but often are, three distinct humans. dbt is built for two personas: the data engineer and the data analyst. It allows both of these humans to work together, combining their expertise to create stable pipelines with minimal friction.
Letâs use that thinking to imagine how we could cross what is probably the biggest divide in data todayâthe divide between data analysts and business stakeholders:
Without a powerful, flexible tool for data consumers to self-serve, the promise of the modern data stack will forever be for a select few.
In that post, I talk about the potency of the spreadsheet UX and how the modern data stack hasnât seen such a powerful interface yet. Seen through the lens of devops, I think we can sharpen this question:
What would a truly multi-persona spreadsheet look like?
Imagine a spreadsheet that is built for both the business stakeholder and the analytics engineer. Itâs Excel in the cloud, plus:
a native interface to querying a data warehouse, outputting results to a sheet.
the ability to natively participate in the dbt DAG by using `ref()`, and the ability for its data to be referenced by downstream computation in the data warehouse(!).
support for classic Excel-style formulas or R and Python functions.
a file format that natively separates data and computation and presents code in a way that can be easily code-reviewed by another human.
âŚetc. You get the idea.
My point is: this is what a multi-persona spreadsheet could look like, where the two personas are the Excel user and the analytics engineer. You can see bits of this starting to come together in what Bobby Pinero and team are building at Equals.
The problem with spreadsheets has always been their complete and utter lack of operational maturity. What if adding another persona to the mix could help? What other gaps could we close by building tooling that appeals to two personas instead of one?
Refuse to click buttons
If youâre constructing a system meant to produce highly-replicable results, humans cannot be a part of that system. Humans areâŚunpredictable. Computers are much more reliable. Computers also have a rather significant cost advantage :)
In devops, all systems are described in code. This is known as infrastructure-as-code, and it is one of the most central tenets of the field. This characteristic of devops enables so many of its wonderful traits: governance (infrastructure is version-controlled), replicability in dev-test-prod, robustnessâŚ
This hard requirement that all systems be described in code means that every single tool in the modern software engineerâs tool belt must be completely controllable via its API.
In practice, devops practitioners donât just open up a Python file and start writing scripts; they tend to use Terraform to express their infrastructure. And if youâre familiar with dbt, youâll find Terraform pretty familiar: Terraform was the inspiration for dbt.
The interesting thing about Terraform that feels relevant to data today is that Terraform wouldnât work if the various APIs it orchestrates didnât support the right operations.
Terraform doesnât actually create and destroy EC2 instances or configure VPCs or, really, do anything itself. Rather, Terraform knows how to call APIs that AWS or GCS or Azure (or Fivetran!) provide to perform the actions it needs to perform. Terraform has a bunch of providers that understand how to call the APIs of the various services it orchestrates, and at this point most Terraform providers are now community- or vendor-supported. In the early days the Hashicorp folks (who make Terraform) built the initial providers themselves without any permission from the platforms that they were orchestrating.
This is one of the wonderful things about software engineering. If you make something programmable (via great APIs), you enable users to build layers of abstraction on top of it, finding new and exciting ways to use the product. These users can do so without needing permission from you to do so. The folks at Hashicorp never asked permission from AWS, and itâs not clear that, if they had been forced to, they wouldâve gotten it. Permissionless innovation is wonderful for an ecosystem, and itâs fundamentally unlocked by APIs and standards.
In data, the APIs of our leading products are not as mature. They exist, but are often secondary to the graphical user interfaces.
dbt Cloud exposes a reasonably feature-complete API but using it doesnât present a fantastic developer experience today. This is an area of focus for us. There is community-maintained Terraform provider but itâs not getting much use today.
Fivetran has a nascent Terraform provider (linked above), and you can see that the team is iterating on it actively (kudos!).
Looker has famously excellent API coverage to control the behavior of an instance but doesnât make it possible to script the creation of an instance itself as far as I know. There are a few community-maintained Terraform providers but nothing that looks mature.
Snowflake, Bigquery, and Databricks all have mature APIs and Terraform providers.
So: weâre not there yet across all layers of the stack but there is movement in the right direction. Providing an excellent API should be a focus for each tool vendor in the space and choosing tools with excellent APIs (or who are committed to this vision) should be important to forward-thinking buyers.
Iâm genuinely excited to see what innovation will be unlocked once the entire stack is scriptable. The automation of environment creation will be only the first of many use cases.
Kill your darlings
I write a lot these daysâŚprobably more today than I ever have. You know when I wrote very little? When I had to write with a pencil and paper (for me, prior to my first Squarespace account in 2003). The worst part about pencil and paper is how annoyingly hard it is to erase something. Sure, erase a word, no problem. You may have some eraser crumbs to brush away, but not an issue. What about when you realize the last two paragraphs sucked and you need to trash them? Ugh. In pencil-and-paper world you have to transcribe the top of the page onto a new sheet of paper and then crumple up the prior version and trash it.
The funny thing about the ability to delete things is that it makes you less attached to each individual new idea you write. Ideas never come fully-formed; it is in the writingâand the editingâwhere they come to life. Making it easy to delete makes it easy to start writing; it also makes it easier to do a good job of editing. If youâre not attached to your ideas, you donât worry about killing them off (literally or figuratively). This is neatly summarized in literary tradition as âkill your darlings.â Stephen King put it well:
kill your darlings, kill your darlings, even when it breaks your egocentric little scribblerâs heart, kill your darlings.
Software engineering works the same way. Over the last twenty years, Agile has taken over from Waterfall as the default methodology for developing software. In the loosest possible terms, the difference between the two approaches is that Agile recognizes that it is not possible to know everything at the start of a project. Instead, it is antifragile to new informationânew information makes the final outcome stronger instead of weaker. Said another way: agile encourages editing. And generally, we should strive to practice agile when we manage data projects.
In devops, there is a metaphor to summarize this type of thinking: pets vs. cattle. It is rather a harsh metaphor and one that you might not love, but it isâŚeffective:
In the old way of doing things, we treat our servers like pets, for example Bob the mail server. If Bob goes down, itâs all hands on deck. The CEO canât get his email and itâs the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, itâs taken out back, shot, and replaced on the line.
Devops attempts to move all systems from pets to cattle, from systems that are hand-fed to systems that can be reasoned about in large numbers. Fundamentally, this enables software engineers to hit the delete key. Donât like the way something is working out? Trash that branch and check out main. Or revert the last two commits. Want to re-think your networking? Delete that section of the config, rewrite it, run `tf apply`. Nothing is set in stone, so get on with creating and edit as needed.
In data, this is not our reality today. Iteration on widely used analytical assets is limited for fear of breaking production, because there is no good workflow in a dashboard (or an ingestion pipeline) for branching, CI/CD, PR review. There have been attempts made to deliver this type of experience, but I havenât seen one that accomplishes it cleanly. At some level they all seem to break down to âduplicate this thingâ which doesnât really help as it leads to a proliferation of disconnected assets, not a single history and canonical version.
I donât know exactly how to solve this problem in a way thatâs consistent with the inherently-visual nature of many analytical products. I do know that if we donât allow ourselves to more freely experiment we are fundamentally limiting our ability to form ideas. Writing is thinking, after all.
Ok thatâs plenty for a newsletter. Sorry for inundating you with a wall of text :P There are a few sections of this thinking that Iâll save for my talkâsign up and join me live if youâd like to go deeper! Also, hereâs a great Twitter thread on this topic:
Elsewhere on the internetâŚ
If you are a Head of BI & Data, in most cases you should really see yourself as a Head of Questions & Answers. You want to have a good grasp on which questions your organization is asking, you want to find ways for your organization to answer questions at scale, to enable people to interpret the answers and create a culture that encourages them to ask more and better questions.
You are not Head of Dashboarding & Reporting.
đŞ Benn Stancil on why so many of our conversations about our own profession are superficial:
[The secrecy of our work product], however, makes our job as analysts a lot more difficult, especially for those just entering the field. To extend Randy Auâs woodworking analogy, the veil around analytical work forces us to talk about the saws we prefer, the types of wood we like, and the paths we can take through our apprentice program without talking about the actual chairs we build.
Totally agree. Gosh I could fill a book with the actual business insights Iâve worked on over the past decade, but that would generate rather a lot of lawsuits. So weâre all limited to talking about the more meta. My guess is that Substacks published by structural engineers talk about amazing buildings and bridges, not about how to choose a really great CAD tool.
đ¸ď¸ Abhi @ Flexport on data mesh using dbt and Snowflake. Yes!! Thanks Abhi, this was awesome. Also check out Jillianâs love letter to it, which contains some bookmarks to highlights.
đŠâđź Very interesting post on the idea that dashboards donât get created to make business decisions, they get created to address emotional needs.
Why do they ask for a dashboard, then? The answer, which this product leader delivered with a grin, is that the exec often has an emotional need to feel on top of the business. âWhat this means is that it's not enough to say to the exec âoh, you don't want a dashboard, you want a thing that compels actionâ; you have to think about how to address that emotional desire at the same time.â
Iâm totally sympathetic to this idea and definitely find myself checking dashboards at times simply to satisfy an emotional need. But itâs also a very cynical perspective that gives very little credit to the executive persona. Not sure how to feel about this overall.
Last note before hitting sendâŚI hope youâre as excited as I am about whatâs going on in our industry right now. I had a conversation yesterday with the CDO of a 60k employee company and he just really really got it and wanted to help convert his peers in other companies to The Viewpoint as well.
There is really special stuff going on right now; practitioners have a voice, bigcos are joining the party, there is real movement on decades-old problems. I feel like the entire industry is a powder keg just waiting for the first big in-person industry conference to happen where we can all get together in one place and come back three days later having figured it all out :P
Soon, soonâŚ
- Tristan :)
All the stories seemed to b good
I'm about to fall asleep. My eyes r so heavy.