The evolution of databases (w/ Wolfram Schulte)
In the first episode of our season on developer experience, the cofounder and CTO of SDF Labs, now a part of dbt Labs, discusses databases, compilers, and dev tools.
Summary
Welcome to our new season of The Analytics Engineering Podcast. This season, weâre focusing on developer experience. Weâll explore the developer experience by tracing the lineage of foundational software tools, platforms, and frameworks. From compilers to modern cloud infrastructure and data systems, weâll unpack how each layer of the stack shapes the way developers build, collaborate, and innovate today. Itâs a theme that lends itself to a lot of great conversations on where weâve come from and where weâre headed.
In our first episode of the season, Tristan talks with Wolfram Schulte. Wolfram is a distinguished engineer at dbt Labs. He joined the company via the acquisition of SDF Labs Labs, where he was co-founder and CTO. He spent close to two decades in Microsoft Research and several years at Meta building their data platform.
One of the amazing things about Wolfram is his love of teaching others the things that he's passionate about. In this episode, he discusses the internal workings of data systems. He and Tristan talk about SQL parsers, compilers, execution engines, composability, and the world of heterogeneous compute that we're all headed towards. While some of this might seem a little sci-fi, itâs likely right around the corner. And Wolfram is inventing some of the tech that's going to get us there.
Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.
Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.
Listen & subscribe from:
Chapters
01:35 Introduction to dbt Labs and SDF Labs collaboration
04:42 Wolfram's journey from monastery to tech innovator
07:55 The role of compilers in database technology
11:05 Building efficient engineering systems at Microsoft
14:13 Navigating data complexity at Facebook
18:51 Understanding database components and their importance
24:44 The shift from row-based to column-based Storage
27:40 Emergence of modular databases
28:44 The rise of multimodal databases
30:45 The role of standards in data management
35:04 Balancing optimization and interoperability
36:38 Conceptual buckets for database engines
38:46 DataFusion compared to DuckDB
40:44 ClickHouse
44:20 Bridging the gap between SQL and new technologies
50:55 The future of developer experience
Key takeaways from this episode
From monastery to Microsoft: Wolframâs journey
Tristan Handy: Can you walk us through the Wolfram Schulte origin story?
Wolfram Schulte: I was born in rural GermanyâSauerlandâand ended up in a monastery boarding school after my father passed away. Their goal was to train monks and priests, but that didnât stick for me.
Later I went to Berlinâback then you had to cross East Germany to get thereâand began studying physics. But I realized everyone else understood physics better than I did! One day I walked past a lecture on data structures and algorithms, and I was hooked. I hadnât written a line of code at that point, but I switched to computer science immediately.
After my PhD in compiler construction, I joined a startup, then landed at Microsoft Research in 1999 thanks to a chance encounter with the logician Yuri Gurevich.
Inside Microsoft Research and Cloud Build
At Microsoft Research, we were like Switzerlandâneutral across teams like Office, Windows, and Bing. Weâd invent tools and ideas, but often the business units didnât trust them. That changed when I was asked to build an engineering org.
We created Cloud Build, a distributed build system like Googleâs Bazel. It reduced build times from hours to minutes and had a huge impact on iteration speed, productivity, and even morale. People stayed in flow. Builds were faster, cheaper, and smarterârunning mostly on spare capacity.
Janitorial work at Meta: cleaning up big data
You later joined Facebook (Meta). What was that like?
A different world. No titles for engineers. Egalitarian, fast-moving. I joined to clean up the data warehouseâwhat they called âjanitorial work.â At Meta, each type of workload had its own engine: time-series, batch, streaming, etc. This made understanding lineage and dependencies across systems extremely hard.
We responded by building UPM, a SQL pre-processor that stitched metadata across engines. It became part of Metaâs privacy infrastructure and compliance tooling, especially after the fallout from Cambridge Analytica.
Databases as compilers
Letâs shift gears. Can you walk us through how analytical databases actually workâlike a professor at a whiteboard?
Sure. Think of a database like a compiler:
Parsing & analysis: Is the SQL valid? Are the types correct?
Optimization: SQL is declarative, so you can reorder joins, push down filtersâbased on algebraic laws like associativity.
Execution: Often done in parallel, especially in modern warehouses.
Storage: Columnar vs. row-based; optimized formats like Parquet or ClickHouseâs custom format.
Historically, storage and compute were bundled. Now theyâre decoupled. But when the engine understands the format deeply, performance is much better.
The rise of modular and composable data platforms
How did we get from monolithic systems to the composable database architectures we have today?
It started with the rise of big dataâHadoop, HDFS, MapReduce. That decoupled compute from storage. Columnar formats like Parquet enabled analytical workloads. Then came Iceberg, Delta Lake, and similar standards that enabled multiple engines to share data.
Modern databases are modular. For example, Postgres is transactional, but you can bolt on an OLAP engine for analytical queries. You can mix and match based on your workload. The result is a data ecosystem thatâs far more flexibleâbut also more complex.
Engine families: Snowflake, DuckDB, ClickHouse
Can you help us bucket the different kinds of engines out there?
Totally. Here are three buckets:
Cloud-native engines: Snowflake, BigQuery. Theyâre optimized for massive scale, often with their own proprietary storage.
Embedded/single-node engines: DuckDB, DataFusion. Great for local dev or embedded analytics. DuckDB is for users; DataFusion is for database builders.
Real-time/high-throughput engines: ClickHouse, Druid. Tuned for streaming and extremely fast aggregations.
Each has its trade-offs. Increasingly, projects are combining these. For example, you can plug DuckDB or DataFusion into Spark to speed up leaf-node execution. The whole engine space is getting more composableâand more interchangeable.
The role of SDF in dbtâs future
If you think about the future where SDF is fully integrated into dbt Cloud, what does that enable?
Initially, it might feel the sameâbut faster, smarter. Longer-term, we can give developers superpowers.
Imagine your dev environment proactively surfaces:
âThis data looks different than yesterdayâwant to investigate?â
âYouâre missing a metric thatâs often used alongside this one.â
âThis join will behave differently on engine Xâhereâs what to change.â
Thatâs the kind of intelligent, predictive developer experience weâre building. Weâre catching SQL up to what IDEs have done for code. And if we can make logical plans portable across engines, dbt becomes the consistent interface across heterogeneous compute.
This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.