The evolution of databases (w/ Wolfram Schulte)

In the first episode of our season on developer experience, the cofounder and CTO of SDF Labs, now a part of dbt Labs, discusses databases, compilers, and dev tools.

Apr 28, 2025

Summary

Welcome to our new season of The Analytics Engineering Podcast. This season, we’re focusing on developer experience. We’ll explore the developer experience by tracing the lineage of foundational software tools, platforms, and frameworks. From compilers to modern cloud infrastructure and data systems, we’ll unpack how each layer of the stack shapes the way developers build, collaborate, and innovate today. It’s a theme that lends itself to a lot of great conversations on where we’ve come from and where we’re headed.

In our first episode of the season, Tristan talks with Wolfram Schulte. Wolfram is a distinguished engineer at dbt Labs. He joined the company via the acquisition of SDF Labs Labs, where he was co-founder and CTO. He spent close to two decades in Microsoft Research and several years at Meta building their data platform.

One of the amazing things about Wolfram is his love of teaching others the things that he's passionate about. In this episode, he discusses the internal workings of data systems. He and Tristan talk about SQL parsers, compilers, execution engines, composability, and the world of heterogeneous compute that we're all headed towards. While some of this might seem a little sci-fi, it’s likely right around the corner. And Wolfram is inventing some of the tech that's going to get us there.

Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Chapters

01:35 Introduction to dbt Labs and SDF Labs collaboration
04:42 Wolfram's journey from monastery to tech innovator
07:55 The role of compilers in database technology
11:05 Building efficient engineering systems at Microsoft
14:13 Navigating data complexity at Facebook
18:51 Understanding database components and their importance
24:44 The shift from row-based to column-based Storage
27:40 Emergence of modular databases
28:44 The rise of multimodal databases
30:45 The role of standards in data management
35:04 Balancing optimization and interoperability
36:38 Conceptual buckets for database engines
38:46 DataFusion compared to DuckDB
40:44 ClickHouse
44:20 Bridging the gap between SQL and new technologies
50:55 The future of developer experience

Key takeaways from this episode

From monastery to Microsoft: Wolfram’s journey

Tristan Handy: Can you walk us through the Wolfram Schulte origin story?

Wolfram Schulte: I was born in rural Germany—Sauerland—and ended up in a monastery boarding school after my father passed away. Their goal was to train monks and priests, but that didn’t stick for me.

Later I went to Berlin—back then you had to cross East Germany to get there—and began studying physics. But I realized everyone else understood physics better than I did! One day I walked past a lecture on data structures and algorithms, and I was hooked. I hadn’t written a line of code at that point, but I switched to computer science immediately.

After my PhD in compiler construction, I joined a startup, then landed at Microsoft Research in 1999 thanks to a chance encounter with the logician Yuri Gurevich.

Inside Microsoft Research and Cloud Build

At Microsoft Research, we were like Switzerland—neutral across teams like Office, Windows, and Bing. We’d invent tools and ideas, but often the business units didn’t trust them. That changed when I was asked to build an engineering org.

We created Cloud Build, a distributed build system like Google’s Bazel. It reduced build times from hours to minutes and had a huge impact on iteration speed, productivity, and even morale. People stayed in flow. Builds were faster, cheaper, and smarter—running mostly on spare capacity.

Janitorial work at Meta: cleaning up big data

You later joined Facebook (Meta). What was that like?

A different world. No titles for engineers. Egalitarian, fast-moving. I joined to clean up the data warehouse—what they called “janitorial work.” At Meta, each type of workload had its own engine: time-series, batch, streaming, etc. This made understanding lineage and dependencies across systems extremely hard.

We responded by building UPM, a SQL pre-processor that stitched metadata across engines. It became part of Meta’s privacy infrastructure and compliance tooling, especially after the fallout from Cambridge Analytica.

Databases as compilers

Let’s shift gears. Can you walk us through how analytical databases actually work—like a professor at a whiteboard?

Sure. Think of a database like a compiler:

Parsing & analysis: Is the SQL valid? Are the types correct?
Optimization: SQL is declarative, so you can reorder joins, push down filters—based on algebraic laws like associativity.
Execution: Often done in parallel, especially in modern warehouses.
Storage: Columnar vs. row-based; optimized formats like Parquet or ClickHouse’s custom format.

Historically, storage and compute were bundled. Now they’re decoupled. But when the engine understands the format deeply, performance is much better.

The rise of modular and composable data platforms

How did we get from monolithic systems to the composable database architectures we have today?

It started with the rise of big data—Hadoop, HDFS, MapReduce. That decoupled compute from storage. Columnar formats like Parquet enabled analytical workloads. Then came Iceberg, Delta Lake, and similar standards that enabled multiple engines to share data.

Modern databases are modular. For example, Postgres is transactional, but you can bolt on an OLAP engine for analytical queries. You can mix and match based on your workload. The result is a data ecosystem that’s far more flexible—but also more complex.

Engine families: Snowflake, DuckDB, ClickHouse

Can you help us bucket the different kinds of engines out there?

Totally. Here are three buckets:

Cloud-native engines: Snowflake, BigQuery. They’re optimized for massive scale, often with their own proprietary storage.
Embedded/single-node engines: DuckDB, DataFusion. Great for local dev or embedded analytics. DuckDB is for users; DataFusion is for database builders.
Real-time/high-throughput engines: ClickHouse, Druid. Tuned for streaming and extremely fast aggregations.

Each has its trade-offs. Increasingly, projects are combining these. For example, you can plug DuckDB or DataFusion into Spark to speed up leaf-node execution. The whole engine space is getting more composable—and more interchangeable.

The role of SDF in dbt’s future

If you think about the future where SDF is fully integrated into dbt Cloud, what does that enable?

Initially, it might feel the same—but faster, smarter. Longer-term, we can give developers superpowers.

Imagine your dev environment proactively surfaces:

“This data looks different than yesterday—want to investigate?”
“You’re missing a metric that’s often used alongside this one.”
“This join will behave differently on engine X—here’s what to change.”

That’s the kind of intelligent, predictive developer experience we’re building. We’re catching SQL up to what IDEs have done for code. And if we can make logical plans portable across engines, dbt becomes the consistent interface across heterogeneous compute.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The Analytics Engineering Roundup

Discussion about this post