Becoming Pangea

Is the modern data stack about to congeal together into a single monolithic land mass?

Sep 10, 2023

Two things before I get into the issue!

First: Coalesce 2023 is coming up!! I am excited—last year’s event was the first time the global dbt Community was able to meet up in person and it was fantastic. The opportunity to build relationships and learn was like no event I’ve been to before. I’ll be in San Diego from 10/16 to 10/19—hope to see you there :D :D

Second: The most recent Analytics Engineering Podcast was a particular favorite of mine to record. It’s with Andy Pavlo, one of the leading brains in the field of databases, and we go all over the database landscape in the space of an hour.

Enjoy the issue!

- Tristan

Becoming Pangea

Consolidation is happening. Two years ago we hypothesized consolidation; looking back we can now validate that hypothesis. From a recent Benn post:

Big vendors, sensing customers’ interests in consolidating around fewer tools, and sensing their own interest in making more money, have started to become more acquisitive. Databricks, the Lakehouse Platform, bought MosaicML, a generative AI company. dbt Labs, the makers of dbt, bought Transform, the makers of a semantic layer. ThoughtSpot, a BI tool, bought Mode, a different sort of BI tool. Teradata, a data warehouse provider, bought Stemma, a data catalog. Alteryx, an older data prep tool, bought Trifacta, a newer data prep tool.

The article goes on to outline some specific scenarios / predictions for the direction that the data & analytics ecosystem takes over the coming years, and I am mostly supportive of Benn’s take. I will say, though, that there are a few questions that I feel are unaddressed in this post that will be important in shaping the coming decade in the ecosystem. I thought it might be fun to dive into some of those here.

What are the long-term strategic moats?

Early on in any new industry, opportunities for profit abound. You know that whole “better to sell picks and shovels” saying? Well, that’s only true until the global supply chain starts bringing picks and shovels in from a low cost of labor region on container ships. It turns out that, over a long-enough time horizon, picks and shovels are a commodity. There is no way to build a strategic advantage in picks and shovels today, so the price drops to a cost-plus model, all profits get squeezed out, and all competition happens through price.1

I guess ‘better to own a source of sustainable competitive differentiation’ just didn’t have the same ring to it.

Moats—the qualities of a business that enable it to defend its ability to generate sustainable profits—are the heart of business strategy. It’s actually not that hard to build a business that makes money in the short term. Here’s the most common recipe in our industry: identify a trend, find something that people need to take advantage of this trend, build it, get it to them, charge money. Cool!

But if you have any success, others see that, and they say “hm, I bet I could do that X% better or Y% cheaper.” That’s when Act 2 of a startup takes place: how do you consistently win in a world where others can see exactly what you’re doing?

The best book on this topic, IMO, is Hamilton Helmer’s 7 Powers. Winning over a decade plus requires, in the terms of this book, power.

Let’s illustrate the point looking at data ingestion for a second. This is an area where I made some incorrect assumptions many years ago and it’s only been recently that I’ve updated my priors.

I was involved with the launch of Stitch. This project started in 2015, and the launch happened in 2016. I decided to leave my role on the exec team at the company because I believed that data ingestion was destined to be a commodity—a category with no strategic leverage that would eventually have all of its profits competed away by new entrants. I continued to believe this for years and watched as many companies started in the space. It seemed like I was right.

But somehow, Fivetran continued to win. It’s not that all of these new entrants have totally failed, but if you add up all of the market value of these companies, Fivetran has by far the lion’s share. How is that possible?

I’ll give George a lot of credit on this: he’s consistently said the same thing to me for 7 years when I’ve asked him this question. Building high quality connectors is hard, and there are a LOT of them to build and maintain. Customers highly value the quality dimension—everything needs to just work. So what you need is a huge customer base to amortize the creation and maintenance costs of these connectors over. If you have a huge customer base that is well-monetized, you can spend more on the connectors and they will therefore be of higher quality.

This is a classic example of the scale economies power. It’s similar to Amazon’s distribution network or Netflix’s ability to spend more on content creation. A new entrant can see this effect playing out but has no way to directly attack it.

This is why the most interesting competitors to Fivetran have been open source. It may be hard to replicate Fivetran’s advantage on the scale economies side now that Fivetran exists, but what if you could use another power—network economies—to compete with it? There are several folks making this bet right now.

My original point, though, was that data ingestion may not be destined to be a commodity! Fivetran seems to be demonstrating a pretty solid moat, and scale economies tend to grow over time.

Answering the question “who has a strategic moat and who is just building software?” is, IMO, the single most useful lens to look at the question of industry shape moving forwards.

If we turn our eyes towards another category that feels unassailable behind its moats today, let me just relay a conversation I recently had with a data leader. This person runs data engineering at a large, data-forward, digital native company. They have a strong penchant for running open source software. I asked “how does this preference for OSS square with your investments in both Snowflake and Databricks?” (they use both heavily).

His answer (paraphrased): “We’re working to build a layer which will allow us to easily shift workloads across data platforms so that we can easily make workload-level decisions about where we want to run things.” Translate that line into a 7 Powers framework: “We’re intentionally reducing switching costs between these platforms so that we can treat them more like commodity compute.”

To be clear, the distance between having this idea and executing on this idea is large. And this isn’t a new thought for the platforms themselves: you can see a lot of their investments over the past few years as functionality that makes it harder to lift-and-shift workloads to another platform (exactly what they should do!). But no one is immune from needing a moat, and this book is not fully written for the data platform layer either.

When I look around at categories in the modern data stack, I see very few moats and a lot of folks building software. The categories with strong moats are the ones that are going to survive.

What standards does the stack get shaped around?

Every technology ecosystem gets built around some type of standards. Whether you’re talking about the Web 1.0 with TCP/IP, HTTP, etc. or Web 2.0 with AJAX and REST, or even standardization of container sizes as a key input to the modern global economy. It all starts with standards.

The modern data stack was built around SQL. Circa 2020, SQL was probably the only standard that mattered very much. But since then, there are a bunch of other standards questions that have real implications for the future.

File Format: Delta / Iceberg / Hudi. Will these formats continue to be widely adopted, and will the lakehouse architecture continue to be demanded by customers? The ability to natively share data across data platforms via common file formats significantly reduces lock-in and creates opportunities for innovation.
Data interchange: Apache Arrow. Will we continue to standardize on Arrow? If so, the classic ODBC/JDBC limitations of SQL databases fall away. We get to re-think what happens inside vs. outside of the core engine, creating space for novel approaches to data processing.
Semantic layer. Will a standard for semantic information emerge? If so, users will finally be able to decouple their business logic from their reporting, freeing them to adopt a wider-array of purpose-built analytical tools. This is obviously something I care a lot about.
Transformation layer / metadata format. We’ve intentionally pursued an aggressively-OSS approach to dbt since its outset in order to pursue the goal of becoming a standard in data transformation, and we’ve been quite successful at this. What is an interesting, and unexpected, result of this is that dbt’s metadata has become a standard for how tools of all types understand lineage. As more workloads continue to be built inside the dbt framework, standardized metadata makes it easier to build novel metadata-powered experiences. Other metadata formats are used, but none have become widely so.
Integrations. Do we ever get a ‘data connector standard’? The MDS has been trying to do that for years, starting with the Singer protocol in 2016. It would be cool, but seven years in and I haven’t seen this come together. It may be that this problem is not well-defined enough yet to be addressed by a standards-driven approach. I do still believe there is something here eventually though, and it would re-shape quite a lot about the current ecosystem.

Overall, standardization is a slow-moving yet very powerful force in the industry. Once large groups coalesce around a standard, they are incredibly hard to unseat. And they then define the shape of the future, as successive technologies get built on top of this standard.

How does the relationship between the hyperscalers, the data platforms, and individual data products evolve?

Here’s a question: why does Confluent exist? Or Hashicorp? Elastic? Mongo? Dare I say, even Snowflake and Databricks?

Each of these companies is directly competitive to one or multiple services provided by each of the hyperscalers. And if you add up all of their market value put together, you’re at a pretty modest percentage of even just AWS, forget GCP and Azure.

Why don’t the hyperscalers just crush them? Is it their large, engaged communities? Fantastic developer experience? Great tech? All of these things are certainly true, and sure, they’re a part of the answer. But I think they’re insufficient. The hyperscalers are each housed within companies worth over $1T. These companies print cash and they’re looking for ways to spend it. Sufficient investment over long horizons can overcome almost any barrier.

When I talk to leaders at these companies, the answer I get is: “the one thing we have that the hyperscalers cannot replicate is that they cannot be multi-cloud.” When I first heard this, it was surprising in just how banal it was. Like…could this possibly be meaningful?

In the many years since then, that answer has been reified for me over and over again. And it turns out that enterprise technology buyers—where most of the money is in data—care about this a lot. I’ve come to believe that they’re right to do so.

It’s hard to overstate how big an effort it is to migrate technology platforms inside of a large enterprise. Expensive, time-consuming, with tremendous unquantifiable opportunity costs. These types of migrations make or break careers, they determine how competitive a business can be for a decade or more. A 500-person company can sprint through a data warehouse migration in 1-2-3 months; a 10,000+ person company needs to hire a GSI, spend $1m+ in consulting fees, and wipe everything else off its roadmap for a year or more.

So leaders at large enterprises are willing to pay for strategic flexibility. And in today’s world, strategic flexibility looks like a couple things:

Open source and open standards. If I needed to, could I run the thing myself? I likely don’t want to, but it’s strategically important to have that option on the table.
Being cross-cloud. What if I pick the wrong horse in the race? What if my provider increases prices? What if…? The more cloud-agnostic my products are, the more optionality I have.

Each of the companies I mentioned above can be run on any of the three hyperscalers—not a similar version of the product, but the same product. Snowflake doesn’t have three different SQL dialects, one per cloud provider, they have one. It is tremendously hard work to provide a single, unified experience on top of three cloud environments! But it provides enterprises with exactly what they want: flexibility.

But it’s not just about migrations. It’s also about failover (risk mitigation), workload shifting (cost optimization), and compliance. Large enterprises really do care about being multi-cloud today, and really are making buying decisions on this basis all the time.

Let’s return to the original question: how does this dynamic impact the future shape of the data industry? Here are my guesses:

The hyperscalers will all have end-to-end solutions. Many of them do already, although these services are of variable quality and usability. They will continue to get better. Just take this as a baseline. If you’re building a new data product in an existing category today, you’ll be competing with some type of managed cloud service that already exists. This significantly raises the bar for new startup creation.
There can be only one. The presence of the hyperscalers won’t prevent outside competition due to cross-cloud requirements, but will put pressure on it. There is likely room for exactly one best-of-breed provider of a given type of product outside of them—this is roughly how things have played out in the larger software infra space.
Small categories won’t be able to sustain themselves. It requires a huge amount of work / time / investment to create a piece of cross-cloud data infrastructure. We are partway down this journey ourselves (dbt Cloud runs on AWS and Azure today) and it has been a much bigger lift than I had anticipated. Enterprise customers require this, but if a given category isn’t large enough to support this level of investment, it simply won’t have enterprise-ready players in it. That category is destined to get folded into other, larger, categories.

So: fewer, bigger categories will support pure-play, at-scale competitors, and there will likely be one clear winner in each category. I’ll leave you to place your own bets on what categories are large enough to sustain themselves.

Put all of these factors together and where do you get?

There is still a lot more consolidation / winnowing to happen. There are too many companies without moats and categories that cannot support at-scale companies.
New standards will continue to reshape the underlying forces in the industry and create opportunities for innovation.

Ok, maybe I’m rounding over some minor branding or distribution details in the hardware / garden tools market, but I think the point largely stands.

John Wessel

You are spot on. Strategic flexibility is a core concern for companies and I was also surprised about multi-cloud being important to people - but it is. Just about every tenured tech executive knows what it feels like to back a looser. The software you bought didn’t perform as advertised, the software company went out of business suddenly, the company got acquired and the acquiring company ruined the product, or the implementer you hired couldn’t get that job done causing a major headache. These experiences create a desire to architect easy switching into every major purchasing decision. Software buyers want prenups. They want flexibility and to retain control and ownership on key pieces of the tech stack - which open source allows. Which does create an issue for SaaS companies and I also see consolidation coming, and hopefully some simplification as well. But that might be a bit to optimistic.

Expand full comment

The Analytics Engineering Roundup

Discussion about this post