Hi! It’s been a minute since I’ve been able to sit down and write; while podcasting is fun (and my most recent episode with Yohei Nakajima was particularly fascinating!), there’s nothing quite like the opportunity to sit down and collect my thoughts via long-form writing. The fact that a big group of humans read this newsletter is just a side benefit; really, this is my excuse to block a big chunk of time on my calendar to do nothing but learn and think. It’s been the only hope I’ve had of staying current over the past decade as the ecosystem has seen multiple massive shifts.
I do want to take a second to say, as I dive into a newsletter issue that is about two specific vendors, what my personal stance is around talking about vendors in this space.
In this newsletter I will share news about vendors in the data space, I will extrapolate and anticipate and analyze, I will express excitement when it’s legitimately held. I will never say negative things about vendors in public. Including competitors.
I used to! When this newsletter was only read by a small group and the powers-that-be in the industry couldn’t care less what I said here, I consistently dunked on vendors that I felt like weren’t innovating or were captive to an outdated mindset. Never unkindly, but very…directly.
But our world is very small, my voice now gets real attention throughout the industry, and dbt Labs partners with basically all of these vendors. You can guarantee that every time I say anything remotely negative about any vendor in the industry, I hear about it.
Fortunately, there are plenty of voices who can fill this role. If you want a “who won the summit wars?” post, this isn’t it. But I really do think there’s enough very positive stuff going on that the horserace doesn’t have to be the story.
Sorry for the long preamble, let’s get into it.
Overall
Both events were very well-attended and well-produced. IMO by far the best experience that I’ve had at either event. I don’t know attendee numbers for both events, but I think each of them had to have over 10k folks in attendance based on my knowledge of other events and crowd sizes.
As crowd size increases, attendee composition changes. Two years ago these were events where I would catch up with industry friends. Snowflake Summit two years ago was a bunch of MDS nerds thinking about their stacks and team structures. Databricks Summit two years ago was just starting to move beyond a group of Spark nerds. Both communities have evolved significantly. If you’re a long-timer, you may have experienced some nostalgia this year. But if you wanted to hear about new innovations and meet potential customers, these events have never delivered more.
Iceberg, Delta, and the Metastore
As much as everyone arrived at both events assuming that AI was going to be The Thing, the big f#$!ing deal this year was actually open table formats. The news items on this front:
Databricks bought Tabular, the makers of a metastore built on top of Iceberg.
Snowflake announced a new open source metastore called Polaris based on the Iceberg REST spec. There appears to be wide cross-ecosystem support for this project, and they committed to releasing / open sourcing the code within 90 days.
Databricks open sourced Unity Catalog, a metastore that supports all of the leading table formats, including Delta, Iceberg, and Hudi. Widely used today within the Databricks ecosystem.
Both events were replete with rumors about this stuff. It’s very clear that it is not a coincidence that it all happened within a two-week timespan, but I’m not going to speculate here on exactly what happened behind-the-scenes or the motivation of the various players. Honestly who cares. What matters to me is that shared, cross-platform open file/table formats and open metastores have the ability to dramatically shift the dynamics inside of the data ecosystem.
(FYI: if the differences between file formats, table formats, and metastores are a bit of a mystery to you, this article from Starburst does a good job of explaining the differences.)
Let me just make some broad statements that I believe to be true.
For most companies, the biggest data-related infrastructure spend is in their compute layer. It’s not in ingest or transform or BI or storage, it’s in compute. And it doesn’t matter what you’re using—a commercial product like Snowflake or Databricks or OSS Presto on raw EC2 nodes—this is nearly always true. Sometimes it can be 10x bigger than the next biggest cost.
Companies are therefore highly incentivized to exert downwards pressure on this spend. This can be a board-level priority for Fortune 500 companies.
But, this is hard to do because of data gravity. If you load all of your data into one system in one proprietary file format, you lose all of your negotiating leverage against that vendor. It is a TON of work to replatform, and potentially career-limiting for the relevant executive if done poorly.
The majority of all workloads that run in modern platforms are defined in languages / frameworks that are not specific to the platforms themselves: SQL, Python, Spark. Many of these workloads could be ported between platforms with modest code changes. This is notably different than the prior era, where logic was locked up inside things like stored procedures.
If you eliminate data lock-in and allow workloads to “travel” between platforms based on cost / performance characteristics, you create a more efficient market for workloads. This allows competition to naturally push prices down over time. Notably, both Sridhar and Ali said—on stage in their keynotes!—some version of “may the best engine win.” So this is competition that they’re both ready for and (seemingly!) welcome. I think this is truly the only customer-centric stance to have and am very happy to see both sides embracing it.
Now, this low-friction workload portability doesn’t happen automatically just because you have an open file format, table format, and metastore. From what I can tell, in order to make this a reality, you need:
An ability to transpile workloads between execution engines’ dialects / environments with accuracy guarantees.
An ability to route workloads automatically between multiple execution engines.
An ability to decision which engine is best suited to execute a given workload.
The platforms themselves have to have a minimum shared level of support for the various table formats and metastores, with appropriate performance characteristics.
The big gate-keeper here in the past has been point #4, and the reason this week was so interesting was that it represented a major new commitment from both Snowflake and Databricks to support these open standards more completely, specifically around Iceberg. I expect that this will open up a flood of innovation over the coming months and will be watching this space closely.
GenAI
GenAI took up a lot of airtime at both events. For both, there was a lot of attendee talk about how much of this stuff is “real” (i.e. driving real consumption and customer value) vs. investing out ahead of where actual customers are today. My read is that there is certainly a bit of both going on, but use cases are starting to emerge. My expectation is that the coming year is the year where this balance flips and we start reading about a lot more real production use cases, because the platform features really do work at this point.
Snowflake announced a number of updates to their Cortex AI platform including:
Cortex Analyst: Allows business users to interact with data in Snowflake using natural language
Cortex Search: an enterprise search offering that uses Neeva retrieval and ranking technology with Arctic LLMs
Cortex Fine-Tuning: A suite of tools for customizing LLMs
AI & ML Studio: A no-code studio for non-technical users to build with AI and ML
Cortex Guard: Flags and filters out harmful content in your data to help ensure your LLM-powered experiences are safe and usable.
The word of the day for Databricks was “compound AI system”, and most of their updates to Moasic AI (the rebranded MosaicML platform) had to do with how an agent-based approach to AI could improve model quality and reliability. I’m not sure I’ve ever seen a production software system, AI or not, that wasn’t “compound,” so I’m not totally sure why this is a useful distinction…? But if we’re just talking about agents, I’m all in. New features included:
Mosaic AI Agent Framework: Allows developers to build their own RAG-based applications on top of Mosaic AI Vector Search, which went GA last month
Mosaic AI Agent Evaluation: A tool for testing how well AI does in production, which includes some components from the Lilac acquisition earlier this year
Mosaic AI Model Training: Fine tuning for open source foundation models
Mosaic AI Gateway: A unified access point for LLMs within an application, allowing customers to switch between models without writing new code
I found it interesting that both companies made a point to highlight experiences targeted at less technical users, like Snowflake’s AI & ML Studio (where they brought a random audience member onstage to build a chatbot in real time…kinda fun!), and Databricks’ AI/BI experience.
There are some beliefs that both companies seem to share about AI, and they may be true, but it’s worth at least spelling them out:
“Enterprise data” is a critical component of the AI story. Attending these events both companies want you to believe you’re at the center of the AI revolution. But given how big AI is, I think enterprise data is only a modest part of it. Will AI impact graphic designers or data analysts more?
Data gravity and security / governance will be the biggest differentiator in enterprise AI, and model quality will matter somewhat less. While both DBRX and SNOW both have their own models, they are not fundamentally transforming themselves into LLM training companies. This feels like a solid bet to me, at least for the next couple of years.
People want to ask questions of their data in natural language. Again, this may very well be true, but it is such a widely-assumed belief that it’s worth at least putting it out there that we could all be wrong about natural language as a good way to ask questions of data. Are we not seeing this behavior in volume today simply because the performance isn’t yet good enough? Or do people not actually want this?
NVIDIA
Both companies talked about deepening their partnership with NVIDIA. Jensen showed up as a part of the keynote at both events, wearing his now-emblematic leather jacket. I feel like a major part of Jensen’s job description these days is showing up at others’ conferences and being an AI booster.
Snowflake highlighted the integration of the Nvidia NeMo Retriever and Inference Server into Cortex AI. I hadn’t heard of NeMo before this announcement. Looks like it’s a very new NVIDIA product that provides developer tooling, but that it’s pre-release right now. I do not have a personal opinion on whether this is needle-moving for practitioners and the sense I got from others at the event was that there were a lot of folks asking what NeMo was. Looks like we all have some learning to do.
Databricks highlighted the integration of NVIDIA’s Cuda computing platform into the Databricks stack and the availability of the DBRX open source LLM as a NIM microservice. The single line in this announcement that I was most curious about was this: “Databricks plans to develop native support for NVIDIA-accelerated computing in Databricks’ next-generation vectorized query engine, Photon, to deliver improved speed and efficiency for customers’ data warehousing and analytics workloads.” This is very interesting. I wrote many years ago about whether or not there was potential to accelerate SQL workloads with GPUs, and to-date, that has not been a meaningful thread of innovation in the industry. I’m not sure what the details are behind what inside of Photon is being accelerated with GPUs, but I’m very curious.
Other announcements
Snowflake
Snowflake Native App Framework integration with Snowpark Container Services. Over 160 applications were launched on the Snowflake Marketplace, including dbt for Snowflake.
More tools for developers including Pandas API support for data scientists using Python, Notebooks, an improved CLI, and an observability suite called Snowflake Trail.
Horizon, Snowflake’s data governance offering, added a private preview of an internal marketplace for data products, plus some additional privacy and security features.
If you want to go deeper on any of these announcements, Snowflake’s Cameron Wasilewsky did a fantastic writeup here. You can also dive into the Snowflake documentation on new features here.
Databricks
Databricks going 100% serverless on July 1 got a loud ovation from the crowd, aka no more worrying about clusters or what version of Spark you’re running.
General availability of Predictive Optimization, a capability that optimizes table data layouts for faster queries and improved performance.
Previewed LakeFlow, a three-part solution for data ingestion (LakeFlow Connect), transformation (Flow Pipelines…from what I can tell this will be DLT under the hood), and orchestration (LakeFlow Jobs). The solution is rolling out in phases, starting with LakeFlow Connect, which will be in preview soon.
Previewed Databricks AI/BI, a BI experience that includes a natural language interface, called Genie, to interrogate your data. Under the hood, it uses agents to learn the semantics of your business and update its understanding of metrics on the fly, instead of relying on a static semantic layer.
- TH
Join data practitioners and leaders in Las Vegas this October at Coalesce, the Analytics Engineering Conference built by data people, for data people. Register now for early-bird tickets to save 50%. The sale ends June 17th, so don’t miss out.
“Databricks plans to develop native support for NVIDIA-accelerated computing in Databricks’ next-generation vectorized query engine, Photon, to deliver improved speed and efficiency for customers’ data warehousing and analytics workloads.” You should check Voltron Data's Theseus.