Reflections and Predictions

A fascinating 2024 is a precursor to a big 2025.

Dec 15, 2024

A hilariously bad (but also kind of delightful?) image generated by GPT-4o, humanity’s current best AI model. 2024 AI still clearly has a long way to go.

I’ve been writing this newsletter since September of 2015. This will be the 10th year I’ve had the opportunity to reflect on a year gone by and make predictions about the year ahead.

In the early years (2015-2017), the data ecosystem was dominated by data science. Data viz, developments in the Python and R ecosystems, posts demoing statistical techniques wrapped up in open source packages, strategies for winning Kaggle competitions, and a community dominated by highly technical people (even if mostly they were just building ETL pipelines in notebooks 😆).

From 2018-2019, attention was more focused on the fall of Hadoop, the return of SQL, the advent of analytics engineering and the dbt community, and the rise of data ops. Many of the developments in the space were led by data and data infra teams inside of large digital natives, most especially Airbnb, Uber, and Netflix. Each of those three companies contributed very meaningfully in both open source code and best practices to the data ecosystem that was to be built next.

From 2020-2022, attention was focused around the modern data stack and the rapid growth of cloud data platforms (including Snowflake’s IPO, which gave us public data about the size of the financial opportunity). Conversation in the community was about new companies started, new fundraising events, and who was going to win what categories. Categories that had once been sleepy became the subject of much attention, and other categories were created from nowhere. Data team sizes grew quickly as companies were flush with cash from the COVID and ZIRP boom, and they spent a ton of time updating best practices and incorporating new tooling into their stacks. Data architecture slides went from having 5-8 logos to 30+ logos.

In 2023, everything changed, very quickly. Inflation drove rates up. Very quickly, all parts of the economy become concerned with the chances of a recession. While the recession never materialized, an immediate pull-back in investment forced a reckoning across the software space. Cloud earnings growth dropped significantly, and downstream of that, almost all software companies started missing quarters. Cue layoffs, often impacting data teams, from across big tech, growth companies, and the enterprise. As a result, attention overnight shifted from improving best practices and platforms to delivering near-term business value.

At the same time, ChatGPT was launched in late 2022 and all the sudden it was no longer clear to either software buyers or software investors what categories of software would be helped and which would be hurt by the coming AI wave. So everyone stayed on the sidelines during 2023.

The “big 5” data platforms (SNOW, DBRX, Azure, AWS, GCP) continued to chug forwards, supported by an underlying exponential: the S-curve of the move from on-prem to cloud for enterprise data. The larger climate certainly impacted the speed of this move, but this megatrend was resilient to the underlying macro.

But in 2024, things changed again. Here are what I consider to be the biggest and most salient things to happen in the data industry over the past year.

Macro stabilized. The predicted recession didn’t happen. Layoffs dissipated, and companies began thinking more strategically and long-term about data (among other things). It was not a return to 2021, but it was a return to stability. Enterprise CIOs and CDOs were again thinking about how to drive their organizations into the future. Venture dollars were still more anemic, so data companies targeting early adopters and SMBs struggled, but if you targeted the enterprise there was renewed customer demand and a path to growth.
Iceberg won. It’s hard for me to tell whether this development was more driven by customers’ desire to avoid lock-in, or more driven by Ali Ghodsi’s maniacal focus on driving the Lakehouse vision, but this was the year that open table formats broke out. The coming out party was in the two back-to-back weeks of Snowflake and Databricks summits when both CEOs made strong public commitments to Iceberg. Immediately, the topic of Iceberg and open table formats became salient to CDOs—I have never seen such a seemingly-esoteric topic go from 0 to 60 in executive interest so fast. Over the next six months, the hyperscalers followed suit, releasing features that made it meaningfully easier to work with open table formats. If Iceberg felt like it was leading in June, by December it is clearly the winner.
AI shifts from a headwind to a (modest) tailwind. In 2023, AI was a headwind to data. In 2024, that changed. It became clear that AI and unstructured data didn’t somehow replace the need for structured data and the associated data pipelines—rather, AI became another downstream use case for existing data technology. EL companies like Fivetran and Airbyte reported an acceleration of growth as a part of customers’ AI initiatives. Data platforms grew their native AI capabilities to bring AI directly into the hands of current data practitioners (think: Snowflake Cortex). Many data products shipped experiences bringing AI into the workflow of data practitioners (think: dbt Copilot). At this point it is clear that a) AI will only make data more critical, b) data practitioners’ work will both change and be accelerated, but not disrupted by AI (at least…not in the foreseeable future).
Consolidation is happening. M&A ramped up in the space this year, and my indicators are that this has even accelerated from H1 to H2. Data companies that raised in ‘20-’22 are running low on cash and many do not have the needed traction to raise another round. I have personally gotten half a dozen inbounds from companies looking for M&A outcomes just over the past month or two, and I get to see even more of this through my angel investing. It is happening. But consolidation is not just about M&A. Many players in the space are beginning to expand into each others’ lanes organically as well. Data clouds building native EL. Observability, lineage, and catalog all smashing into one another. It feels like plate tectonics: slow, but inevitable. And we are headed towards Pangea. The question is: which companies have earned the right to become true platforms? How many will there be? And will they all just be roughly carbon copies of one another or will there be meaningfully different visions on display?
The big players got religion on semantics. SNOW, DBRX, Tableau, and more all introduced semantic layers of varying levels of maturity. If there are three primary use cases for semantic layers (internal analytics, embedded analytics, and AI), these initiatives were either mostly or completely focused AI. This was an important step towards increasing industry awareness / adoption of a long-term important technology.

Here’s how I believe that all of this translates into 2025:

Open table formats get implemented, fast. Few companies use Iceberg in prod today. We have pretty good data on this; widespread adoption is taking some time. But between dbt Labs, EL vendors like Fivetran, the data clouds, and the hyperscalers all building features to make implementation easier, we’re going to start to see this line go up quickly.
The rise of utility compute. Fivetran recently released the ability for customers to write Iceberg tables without needing to pay for any underlying compute (on platforms like Snowflake and Redshift this had previously incurred a non-trivial cost). This quote from the post is fascinating: “When you build a specialized engine for a specific workload, you can be more efficient, because you can rely on special characteristics of your workload. For example, when Fivetran built our data lake writer service, we were able to make it so efficient that we can simply absorb the ingest cost as part of our existing pricing model. Ingest is free for Fivetran data lake users.” This will happen more often with the move towards Iceberg: purpose-built engines will run very specific workloads in a highly-optimized way. I am calling this “utility compute.” It is not a substitute for the engines that allow for processing chunky production workloads, but it is an ability for vendors to create very significant optimizations specific to their own very particular workloads. I would expect to see more products do what Fivetran did over the coming year.
A diversification of compute environments, but a unified layer on top. If open table formats encourage customer choice and enable multiple purpose-built compute engines to proliferate, there still needs to be a single pane of glass into the entire data estate. Previously, it was common for CDOs to say something like “We’re a [Snowflake/Databricks/etc.] shop.” That is much more rare today: already, over half of all enterprises use multiple data platforms. So: where does that single pane of glass shift to? Is it the metadata catalog (i.e. Unity)? That doesn’t feel right to me, although from what I can tell that’s Databricks’ bet. Is it the user-facing catalog (i.e. Alation/Colibra/Atlan)? That doesn’t feel right to me either. Watch this space; I believe this is the biggest real estate being fought over in 2025.
Acceleration of consolidation, the rise of end-to-end platforms, and the focus on end-to-end user workflows. Every data platform is going more end-to-end. Azure has Fabric, which is an integrated bundle. Databricks is going end-to-end through both acquisitions and organically built products. GCP has cared about this for years; it is what motivated the Looker acquisition. AWS and SNOW are moving in this direction as well, which is the most interesting indicator for me, because both companies have always been notoriously against doing this in the past. AWS has its approach of selling individual puzzle-pieces and allowing developers to fit them together, and Snowflake has been (to their credit!) very ecosystem focused, letting partners solve all of the adjacent problems that weren’t fundamentally about delivering a compute offering. So: everyone is going wide, going integrated. My read: this is a recognition that the axis of competition is moving from ‘owning the workload’—making it really hard for customers to move workloads from one platform to another—to ‘owning the user’. If open table formats are giving customers more choice as to where those workloads run, it is important for vendors to re-establish strategic power in other ways. So: go wide, own the user experience for the end-to-end data workflow. I don’t know that this industry shift is exclusively positive or negative for practitioners; I do consider it basically inevitable.

I don’t know about you, but I’m honestly feeling optimistic about the future going into 2025. The thing I care most about in data is the ability to make consistent progress as an ecosystem, and not be stuck in a world of cyclicality, rebuilding the same set of technologies over and over again like we have for 30+ years.

The best ways to make that happen are open source and open standards, replacing cyclicality with consensus and steady forward progress. This is how software engineering has made progress for decades.

The fact that we have coalesced around a standard way of storing data, and a catalog to manage transactional consistency on top of that, is just incredibly good news for our collective future. Everything else I wrote above is downstream of that.

I hope you’re ending your year this year with some optimism as well. See you in 2025 :)

- Tristan

The Analytics Engineering Roundup

Discussion about this post