Twin Peaks

Notes from a week with two summits.

Jul 05, 2023

My issue is coming in a bit late this week—travel and holiday schedules have kept me away from my keyboard! I hope you’ve gotten some time to disconnect as well (or will soon!).

- Tristan

Mt. Ushba, Georgia. By User:Stardancer - Own work, CC BY-SA 3.0

In case you missed it, the two biggest data conferences in our ecosystem happened last week. Databricks and Snowflake—competitors for the cloud data platform crown—both somehow scheduled their events at the same time. Many folks have speculated that this was intentional, one battlefront in the larger war. I think this is unlikely. It turns out that venue contracts for very large conferences are a real bear and get hammered out multiple years in advance…a lesson I have had to learn in my day job.

So my guess is that this is just how the random chance of venue availability shook out. But that hasn’t stopped a tremendous amount of the post-summit press from focusing on the inherently competitive dynamic of having your biggest event of the year at the same time your customer is doing the same thing one short direct flight away.

dbt Labs had a huge presence at each event. From what I saw, both events had significantly different energy than last year. Last year, there was a lot of macro-related trepidation in the air. Attendees I spoke to had to battle to get tickets paid for by employers and were nervous about upcoming budget tightening. This year the crowds were larger, and yes, there was plenty focus on workload efficiency, but the overall feeling was more forward-looking, thinking about the future. Data leaders have taken some medicine, but now feel oriented to execute in the current moment.

I don’t really want to talk about feature announcements. Yes, there were some good things that got put out on both sides, from natural platform extensions to investments in brand new LLM-powered features to acquisitions to partnerships. Instead of talking features, let me highlight three themes.

Both companies clearly see AI as existentially important and both companies are leveraging existing capabilities to build a bridge for customers to get there faster. This dynamic is definitely competitive: there is a new class of workloads that will be ramping quickly in the coming months and years and both platforms want to be the pre-eminent destination for these workloads.
Both companies are also going hard at being a place where third parties can build and distribute “data apps,” although it’s not yet fully clear to anyone exactly what a data app is. This is fine—the platforms are investing in making it possible for builders to build, distribute, and sell, and we’ll all have to see what the Angry Birds of this new platform is. Like #1 above, this dynamic is definitely competitive, as platforms will be competing for developer attention.
But here’s what I honestly believe is the most interesting dynamic to emerge from the past week’s feature announcements: both companies are ever-more-invested in the lakehouse. This development is very different from #1 and #2 above—it isn’t simply another piece of real estate that the two platforms can compete over, it is a bit of a wild card.

There are more than enough people spilling ink about both companies’ AI announcements; I think you can do without another analysis there. Plus, I’m honestly not the right person to do it. My very superficial read is simply: we are too early in this game to have good thoughts about each platform’s AI strategy and execution.

Instead I want to talk more about #2 and #3.

Data Clouds or just…Clouds?

Both companies made a lot of noise about the ability for developers to ship applications within their marketplaces, running natively inside their platforms. The main selling points from a vendor’s perspective (speaking as one!) are the ability to bypass much of the buyer infra security review process and to get deployed on top of the three main “physical clouds” in one shot.

Enterprises generally want their data vendors all running on the same cloud provider. If you’re an Azure shop, you want your SNOW/DBX running in Azure, and you probably want your other data stack vendors running in Azure too. There are often good reasons for this (ingress/egress fees!) but there are also just corporate/process reasons that are less about the laws of physics.

Now, if you can directly run your product inside of Snowflake or Databricks, it’ll automatically be available in each public cloud that those providers are running on top of, with zero additional work on your part. Pretty compelling as a vendor! And much easier to buy as a customer.

The challenge is that when you attempt to ship your product “inside of Snowflake” you’ll lose access to any cloud abstractions that are offered by AWS or Azure or GCP. Did you use Cloud Spanner? Or GKE? Or ElastiCache? These are all great services, but they are offerings of the cloud providers that you cannot use if you want to ship cross-platform code hosted inside of one of these marketplaces.

This is not bad, it’s just a tradeoff. You can’t be cross-platform without … being cross-platform.

The above isn’t that interesting, it’s just a natural outcome of the strategy. What is more interesting is that this pressure will almost definitely pull the data clouds to launch offerings that fill these gaps! In talking with some vendors who have chosen to distribute their products via these routes (we aren’t there yet), that’s exactly what is happening. It makes total sense—if you want vendors to deploy inside of your platform, you need to give them the same kind of tools that they have found to be valuable when developing inside of AWS, Azure, and GCP.

Does this mean that the lines between the data clouds and the “physical clouds” (the ones that manage real data centers) get more blurry over time? Is there a realignment to come, where the physical clouds focus more on the basics and pull back somewhat from higher-level abstractions? I have no idea, but it made me really perk up my ears. I think this is something to watch.

Platform Interop via Iceburg

Snowflake released unified Iceberg tables:

Single mode to interact with external data. Unmanaged mode where coordination of changes happens by a different system, or managed where Snowflake looks after it. Managed Iceberg performs as well as native Snowflake Tables.1

Databricks released Universal Format (UniForm):

UniForm automatically generates the metadata for all three of the formats [Delta / Iceberg / Hudi] and automatically understands what format the users is trying to read or write to.2

The most important line in the Snowflake announcement is “Managed Iceberg performs as well as native Snowflake Tables.” Previously, my understanding is that this had not been true; and so if you wanted both performance and portability using Snowflake you had to store your data twice (once natively and once in external Iceberg tables). Obviously this is suboptimal from a variety of perspectives. Now you don’t have to make the choice.

As a practitioner, I don’t honestly care about the “table format wars.” It doesn’t matter to me which table format “wins” and I’m not sure it should matter much to you either.

Similarly, I don’t really think it’s useful to follow the developments of both of Databricks and Snowflake as a horse race, reporting on where the companies are relative to one another (and I’ve tried to avoid doing that here). Sure—if you’re an equity analyst, yeah, you have to make buy/sell decisions based on these stocks on a relative basis. But that’s not why I’m here, and I don’t think it’s why you’re here either. We want to use these products, and we want them to continue to get better.

It’s a given that every year there will be big conferences where both Databricks and Snowflake announce a bunch of new capabilities. Both products will for sure continue to improve. What is not a given is that both products will gravitate towards greater interoperability, greater practitioner choice. That’s why I’m excited about this particular news: in a year where the narrative has been all about competition, this is a bridge between platforms, allowing them to be better-used in conjunction with one another. Cheers to that.

David J

https://www.theregister.com/2023/06/29/databricks_snowflake_tables/

The Analytics Engineering Roundup

Twin Peaks

Notes from a week with two summits.

Data Clouds or just…Clouds?

Platform Interop via Iceburg