Discover more from The Analytics Engineering Roundup
And that's a wrap!
Decompressing from and processing Coalesce 2023.
Where to begin? That was a week.
For those of you who weren’t following along online or in-person, dbt Labs’ Coalesce conference took place over the past week, M-Th, in San Diego, Sydney, London, and online. It was our fourth Coalesce, and our second in-person Coalesce, and by far the biggest and most successful event we’ve ever done.
I want to share some thoughts coming out of the week—there’s a lot to talk about.
Let’s start with product announcements. I’m really proud of what our team has shipped over the past year, and I’ve already booked a big chunk of my November to roll up my sleeves as a part-time member of our data team out of pure excitement.
First: dbt Mesh. dbt Mesh is our solution to complexity at scale. I spoke about this at length in my keynote, and I’ve written about it before. The short version is that, in order to move quickly as our analytics engineering investments grow, we need to adopt the same core principles as have software engineers: service oriented architecture and two-pizza teams. Small groups, empowered to ship, exposing trustworthy interfaces that other teams can build on; the entire system centrally governed and universally visible.
This version of the world only works when teams have great tools to enable this workflow, and that’s what we’ve been focused on for the past year. A year ago, you couldn’t do any of this. Over the past year, we’ve built:
model access control
dbt Explorer (includes visualizing Mesh DAGs)
dbt Cloud CLI and IDE (can now execute Mesh DAGs)
It’s taken quite a lot to bring this vision fully to life, and showcases the benefits of the integrated capabilities that we increasingly have—from programming primitives, to development environment, to state management, to orchestration and execution, to cataloging and discovery. A new paradigm like this has to be possible across all of these touch points or it just kinda…doesn’t work.
Second: dbt Explorer. dbt Explorer gives you visibility into your dbt investments. It combines data discovery, cataloging, lineage, and observability all into a single product experience. It scales to arbitrarily large dbt project sizes (dbt Docs crashes Chrome at around 12-1400 models because of its rather rudimentary architecture), and shares information on the state of your dbt project that Docs has never had any visibility into (i.e. longitudinal model runtimes, success/failure state, etc.).
Explorer has a LONG roadmap in front of it. Today, it’s already a fantastic product (and the usage data from launch day confirms that folks find it useful!), but what I’m most excited about is that this is a product surface area that will go through many fast iteration cycles. There are a million things to build (how about platform cost data? model optimization suggestions?) and now with Explorer out into the world we can ship each of these much smaller lifts fast. Have things that you want to see in here? Message me in Slack.
Third: Cloud CLI. Cloud CLI is something I’ve wanted for six years. I love writing code but hate local environment configuration. We talk to users all the time whose teams are stuck on dbt versions from 2+ years ago because they can’t coordinate upgrades across hundreds of users. I also find building dev environments for increasingly large projects to be a source of frustration (and platform spend). Say goodbye to both problems.
The dbt Cloud CLI will be a massive improvement to dbt’s developer experience. Far easier install experience, no more upgrades, and automatic support for deferral, saving tons of cost and time. Because it’s all backed by state-aware dbt Cloud, we’ll also be able to ship a bunch of DX-focused features that have never before been possible.
Fourth: the dbt Semantic Layer powered by MetricFlow. We’ve been working towards this moment for two full years at this point. The Semantic Layer started out as a cool idea, then an experiment, and is now a production-ready piece of infrastructure. The hard technical problem—writing always-correct, performant, readable SQL from arbitrarily complex semantic definitions—is something that the team at Transform had already solved prior to the acquisition. Now, we’ve launched their technology, deeply integrated into the dbt developer workflow and tooling, along with brand new Tableau and Google Sheets integrations.
Because of the engagement we’ve had with the community over the past two years, we knew that this was going to be a launch that generated a ton of excitement. But we weren’t ready for just how much. On-site at Coalesce in San Diego, we hosted a ‘getting started with the dbt Semantic Layer’ workshop. The session was oversubscribed, so we had to schedule another one. Then, the backup session got oversubscribed, so we had to schedule a third! Hundreds of users got up-to-speed on the latest and greatest over the course of just a couple of days, and it came up in basically every customer conversation I had.
These were four big, new tent poles. They represent massive investments. Some of them we’ve been working on for a year, some of them for two or more years. It feels incredible to get them out into the world and I’m so proud of our team for the work they’ve put in.
Over the coming year, my expectation is that our focus will go from these huge, game-changing investments to many more rapid, iterative investments on top of what we just launched. Stay tuned for much more.
Another big topic at the event was our evolution as a business. This is a topic that I brought directly out into the spotlight with my focus on it in the keynote. I think it’s worth reiterating here, and going a little deeper in sharing some of my thinking.
Some open source projects are sponsored by big tech. They’re built to solve an internal need, then open sourced as branding within the software engineering ecosystem. Come work at Meta? Maybe. Come work alongside the team that built React? Hell yeah.
Some open source projects have sufficiently constrained surface area that they don’t need long-term maintainership, or don’t need much of it. Maybe these are maintained as passion projects, or maybe are funded through donations. (Polar, a company I am a huge fan of and angel investor in, helps scale this!)
But some open source projects, in order to achieve the thing that they want to do in the world, build companies around them. dbt is one of these, but there are many others: Databricks, Mongo, Hashicorp, Gitlab, Automattic… These are just the most well-known ones, but the list is long.
Each one of these companies started with an open source project. That open source project filled a need, and a community developed around it. The creators of the project saw a need to do more. That “more” can look like a lot of different things:
Solve bigger problems by building an end-to-end platform, not just a tool.
Solve complex infrastructure problems with cloud solutions.
Help take a fundamental innovation into parts of the market that couldn’t adopt it without additional features and support.
Make something hard simple.
In this process, these companies have to find a way to build a second product on top of their first—in Ali Ghodsi’s language, they have to “hit two grand slams instead of just one”. The first one is the original innovation that builds the community, the second one is the one that generates revenue.
The product story around open source commercialization is pretty widely understood, but what is less-often talked about is the company and community story. Going through the open-source-to-commercial journey requires a company to be the host for many different tensions—both internally and externally. That company must find a way to stay true to its roots (its original users and technology) while growing beyond them. That company must keep its community-oriented culture, but layer in a focus on sales.
Going on this journey has been challenging over the past year. I don’t know if I’ve navigated it well or poorly, but I’ve tried to do so the only way I know how: transparently. In my keynote on Tuesday I shared my three priorities for us as a business right now:
Sustainability and stewardship
Open source and open standards
I want to pull a couple of quotes from my keynote here to share some additional context.
My goal is to build dbt Labs into a long-term sustainable business. That means we have to keep growing, that means we have to be profitable. I care about this because it’s is the only way that we can continue to steward the dbt Community over the very long term.
dbt, as a product, is not done. Analytics engineering, as a practice, is not yet mature. There is a lot more work to do here, and I care about being in a position to continue to do that work.
I remain committed to the Apache 2.0 license. dbt Core will continue to be licensed under Apache 2.0, and we will continue to invest real dollars in maintaining and improving it for the benefit of everyone in the dbt Community.
I got asked repeatedly over the last month or so whether we plan on re-licensing dbt Core. We do not. The Apache 2.0 license is a critical part of dbt-as-a-standard. We already do, and will continue to, write proprietary code as well, but we have no intention of changing the license for dbt Core.
I want dbt Cloud to be the best way to write, run, and operate dbt.
As an industry, we’ve now seen ~3 decades of commercial open source. There are some problems that open source solves well, and there are some problems it doesn’t.
I’m excited to:
create a more seamless experience for all dbt developers
create products that understand the state of your dbt environment
enable teams to scale dbt to hundreds and thousands of dbt developers
provide SLAs and operational maturity that simply do not exist in data engineering today
All of this requires more than open source software—it requires cloud infrastructure and best-in-class usability. As we build towards the future of dbt, these are the types of problems we’re focused on solving with dbt Cloud.
I think the biggest problems in any open-source-to-commercial journey happen when there are misaligned expectations. That’s something that is ultimately fixable with better communication. I will continue to share updates on the business and our journey, because that is something that every member of the dbt community has a stake in. Whether or not you ever pay us a dime, if you care about dbt, this stuff matters.
The last thing I wanted to share were just some ruminations I’ve had as I’ve personally spoken to 100+ folks at length over the course of the week. In no particular order:
The future is not evenly distributed. Many, many forward-thinking companies still are in the very early stages of their journey with dbt and analytics engineering, and their current mechanisms for using cloud data platforms are not so different from what I observed in 2015. The stories on why this is true are always unique, bespoke. Leadership changes, a lack of urgency around data when things were going well, whatever—the point is that many of the things that you and I may have taken for granted for years are still very new for many, many companies. And not just 100-plus-year-old enterprises, either.
The name of the game right now is org change. dbt and modern data workflows are often in use by a part of an organization, and that part wants to see that usage go much, much broader. The challenges they face are primarily not technical, they are organizational. The number of conversations I had over the past week that have some element of “the problem is that that team rolls up to a completely different organization” was very high.
Data leaders want fewer vendor relationships. I’ve talked about this at length before and won’t belabor it here, but it came out very clearly this week. Relatedly, the boundaries between product categories like observability, quality, governance, cataloging, discovery, and lineage are becoming less and less clear. They were never that clear to begin with, and vendors are now increasingly overlapping with one another’s functionality.
We are in the deployment phase rather than the initial hype phase. This shifted maybe ~a year ago, and we’ve fully settled into that as an ecosystem. Folks who joined in because data was flashy and exciting have moved on to the next flashy and exciting thing. I am happy about this—there is real work to do, real problems to solve, as we try to make lasting progress on challenges that we’ve been talking about as an industry since the 90’s.
Related to the above—a lot of folks I spoke to are wrestling with migrations. Even getting a handle on what kind of investment it would take to decommission a legacy Informatica instance is quite challenging, much less actually doing it. These migrations are absolutely happening, but I don’t think there is a well-established playbook yet.
That’s all I got for now. Thanks, everyone, for an incredible week of Coalesce. We’re already back at work getting ready to raise the bar on the event next year—go ahead and register now if you want to get early bird pricing.
Now I’m gonna go get some sleep! :P