The Data Platform Data Platform

Why it's hard to talk about new things and why sometimes we sound a little silly when we try.

Jan 30, 2022

Defrag, the world’s first, only, and best “Data Platform Data Platform”, today announced that it received $42 million in Series A financing in a round led by _tbd Ventures.

At first I didn’t get it. (I Googled “Defrag Series A”!) At first you might not get it. But at some point in your read of the article you will realize it: Stephen Bailey is fucking with you. It took me a while to realize it partially because the whole idea is just so inconceivable—the modern data stack is now big enough that it has people writing sophisticated, lengthy satirical posts?!

“Look, we don’t want to build column-level lineage. We don’t want to build incident management. We don’t want to build a sleek explore experience. We want other people to build that, to build great tools, to make their share of all this data cash,” Tate says. “We want to see metadata put to good use, while also preventing the industry from coming to a halt when dbt Labs upgrades their product to version 2.0, or introduces a new type of node, or modifies their artifacts schema. Defrag can simplify and harden this ecosystem.”

If you’re not deep into this world, this is probably not that funny and you can just move on. But I just about died.

The thing is, though—it’s not just funny. The post does what all good satire does and helps us see ourselves more clearly. But let me return to that in a second.

Before I get there: this is the other post over the past week that just left me stunned. In it, David Jayatillake goes through every major category in the modern data stack and proposes exactly how tools in that category should integrate with dbt, which he terms “the interoperability layer.” He even goes so far as to spell out very specifically what he believes dbt should not do in each category. For example:

dbt should not move data

🙌

I don’t want to comment on any of the specific suggestions that David makes (there are one or two that I disagree with!), but I will say that I think he is extremely aligned with how we’d like dbt to fit into the overall ecosystem at a philosophical level. We (dbt Labs) don’t want to build data ingestion, or Airflow-style orchestration, and certainly not a BI tool.

Tristan Handy @jthandy

@frasergeorgew also when @drewbanin and I started the company our one pact with each other was "we will never build a BI tool." so...there's that.

We want to connect all of these things together. We’re interested in flows, connections, maps. It’s not that it doesn’t sound like fun to build other stuff—it does! But dbt has become this weird in-between layer that everyone expects to interoperate with everything. From David:

I now expect data vendors to integrate with dbt as a matter of course. With dbt, at this time, we have a chance to have a shared data interoperability layer that enables a vast amount of data vendors to enter the market; the foundations of a data OS.

Seeing folks think like this makes me excited! It also means that building products in adjacent layers is a dubious prospect for us (dbt Labs) as this would threaten dbt’s status as neutral interoperability layer. I wrote about this in the same thread as above:

Tristan Handy @jthandy

@frasergeorgew I think *a lot* about ecosystem construction. How do we make partners out of as many companies as possible? What is the smallest number of things that we can do that create the most change for practitioners?

The hard part about this approach is that it’s not clear what exactly the thing is. Gartner doesn’t have a magic quadrant for it. Buyers don’t have a budget line for it. The data industry hasn’t seen something quite like it before.

While it’s increasingly clear (to practitioners) what jobs this thing has to do, what’s less clear even to those of us who work in the space every day is what it should be called or what it will feel like to live on the other side of this event horizon. David calls it the interoperability layer. Benn calls it the Data OS. I’ve always thought of it in my head as the “organizational knowledge graph” but…that’s probably not quite right any more.

It’s important to know what to call something, because names really do have power. All of these categories—from catalogs to observability to lakehouses (etc)—are ideas that live in our collective heads and become real because we think them. As Chomsky would say, if we can’t name it we can’t think it.

dbt started off as a programming framework for data transformation. Along the way it’s become something more than that. This something is all centered around being the hub of organizational knowledge. Functionally it lives in the neighborhood of: batch-based data transformation, lineage and metadata, live query mutation (metrics, entities, more?). It is often not the experience you log into, but it increasingly powers that experience.

How would you talk about this?

Until we all collectively figure it out, sometimes this is what we (I?) sound like:

Tate is in it for the long haul. “No one knows what the data stack will look like in ten years, but I can guarantee you this: metadata will be the glue.”

🤦

Elsewhere on the internet…

🧱 ❄️ The Information covers the war between Snowflake and Databricks in an in-depth feature. It’s paywalled, but worth the link nonetheless. Here is my favorite bit:

“In the vast majority of accounts that we are in, we coexist with Snowflake—the overlap in accounts is massive,” Ghodsi said in an interview. “What we’ve seen is that more and more people now feel like they can actually use the data that they have in the data lake with us for data warehousing workloads. And those might have been workloads that otherwise would have gone to Snowflake.”

This is interesting—given the different backgrounds of the two products, it’s a common occurrence that they are both inside a given customer at the same time. This means that it’ll largely be practitioners making the decisions where to run each individual workload that are going to determine who the winner is instead of sales folks convincing executive stakeholders. With usage-based pricing for both products, it’s less about “can you land the account” and more about “do they choose your product for most of their jobs.” This feels good…competition on the merits of the products, decided upon by users.

Of course, the most likely outcome is that both products and companies are very successful:

“I think Snowflake will be very successful, and I think Databricks will be very successful,” [Ghodsi] said. “You will also see other ones pop up in the top, I’m sure, over the next three to four years. It’s just such a big market and it makes sense that lots of people would focus on going after it.”

👀 Speaking of which…Firebolt just raised a Series C :P

🥞 Incident.io wrote a great piece on their modern data stack build that they’ve recently completed. If you’ve been through the process you’ll be familiar with much of what they’ve done, but there are some good nuggets that were new to me, especially towards the end. The article talks about dbt-metabase, which I had not previously known about, and gave a detailed look at the workshops they used to roll out the new tooling to the entire company. Super useful all around.

⚖️ How do you effectively govern data? Mark Grover @ Stemma and Benn Stancil @ Mode co-wrote this piece recently in which they get into some suggestions. Here’s my favorite:

Let there be mess
With the amount of data that organizations have today, any effort to review and document it has to be focused. Quora and StackOverflow show us that questions, rather than documentation, help uncover the most important issues.
In data, we should follow the same principle: Use the questions people are asking to find data hotspots and focus our energy on those. That means some corners of your data will be messy, and some concepts will go undocumented. That’s ok, so long as there’s a method for identifying when those areas “heat up.”

I love this. IMO the hardest part about internal data governance will be in constructing a process in which the right people’s attention is directed to the right places at the right time, and that everyone has the right incentives throughout the process. Mark and Benn are absolutely in this same headspace.

Vicki (implicit) @vboykis

I have this working theory that there are some fundamental tools that you'll need in any technical job, across time and job descriptions and stacks, and that those three tools are version control (more specifically git), SQL, and bash.

Strong agree! Full post expounding on the idea is here.

Ergest Xheblati 🦊 @ergestx

I’ve been writing SQL for ~15 years. I’ve seen hundreds of thousands of lines of code. Over time I developed a set of patterns and best practices I always come back to when writing queries. This is my attempt to decode them 👇👇👇

Good stuff. If you’re an experienced dbt user you’ve likely internalized a lot of these lessons already, but this thread is a high-impact introduction to SQL best practices.

The Analytics Engineering Roundup

Discussion about this post