Why compilers matter (w/ Lukas Schulte)
We continue our season on developer experience by looking at compilers with the SDF Labs cofounder.
Tristan Handy dives deep into the world of compilers in this episode of The Analytics Engineering Podcast with Lukas Schulte, cofounder of SDF Labs (not to be confused with last episode’s guest—Lukas’ dad and fellow SDF cofounder Wolfram Schulte). Tristan and Lukas discuss what compilers are, how they work, and what they mean for the data ecosystem. SDF, which was recently acquired by dbt Labs, builds a world-class SQL compiler aimed at abstracting away the complexity of warehouse-specific SQL.
The conversation covers the evolution of compiler technology, what software engineering has gotten right over the past several decades, and why the data ecosystem is poised for similar transformation. Lucas and Tristan explore why SQL has lagged behind other programming ecosystems, and how new compiler infrastructure could lead to package management, interoperability, and greater innovation across data platforms. It’s a fascinating (and timely) episode: Get ready for the new dbt engine.
Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.
Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.
Listen & subscribe from:
Chapters
02:40 The vision behind SDF Labs
04:00 What is a compiler?
05:00 Components of a compiler: frontend, IR, backend
08:00 Syntax vs. semantics and the role of parsing
10:00 Logical vs. physical plans in SQL compilers
13:00 Historical context: mainframes to LLVM
16:00 Cross-architecture portability in Rust & other compilers
18:00 What is LLVM and why it matters
20:00 Bootstrapping and the self-recursive nature of compilers
21:00 Compilers in Java, TypeScript, and dbt
23:00 Why compilers are foundational to software ecosystems
26:00 The SQL dialect problem in data warehouses
29:00 Can SQL get its own LLVM?
31:00 How Substrate and DataFusion aim to standardize SQL
35:00 Package management and the path toward SQL abstractions
38:00 The future of the data ecosystem with a common SQL compiler
Key takeaways from this episode
What is a compiler?
Tristan Handy: What is a compiler?
Lukas Schulte: It's something that takes higher-level human-readable code and translates, compiles, rewrites it into lower-level machine code that is much harder for humans to understand and much easier for machines to understand.
Compilers typically have phases. They have a frontend that deals with the language you're working with, a middle component—usually called an IR or intermediate representation—and a backend that takes that IR and compiles it into machine code.
Compiler phases: frontend, IR, backend
Tristan Handy: How does it all come together?
Lukas Schulte: There’s a preprocessor that handles macros, removes comments, and prepares the text. Then a lexer converts it into tokens. These tokens get assembled into a tree that the compiler can understand. That’s where syntax validation and semantic analysis happen.
From there, we build a logical representation of the operations we want to perform. That transitions to a physical plan, which starts considering the hardware: how many cores, how much memory, which files we’re accessing. After that, optimizations are applied and it compiles to actual machine code using a toolchain like LLVM.
Syntax vs. semantics
Lukas Schulte: Let’s break down syntax vs. semantics.
Imagine the code x = x + 1
. That has valid syntax. Its meaning—its semantics—is that we’re incrementing x
by 1.
Now, you could also write x += 1
. Different syntax, same semantics. So syntax defines structure, and semantics define meaning. That distinction is important when you’re analyzing or transforming code.
LLVM and portability
Tristan Handy: Have we been building abstraction layers like this for decades?
Lukas Schulte: Absolutely. That’s what LLVM does. It provides a consistent intermediate representation that compilers can use to target multiple backends—Intel, ARM, different OSes. Apple invested early in LLVM to support custom chips.
With Rust, for example, LLVM is what lets us build binaries that behave the same on macOS, Windows, and Linux with relatively little effort.
Bootstrapping compilers
Tristan Handy: So there’s this recursive loop—compilers being built with other compilers?
Lukas Schulte: Exactly. Rust wasn’t always written in Rust—it started in C++. Eventually, the compiler was rewritten in Rust itself. Now, Rust compiles Rust. It’s fully self-hosted. That’s common with mature languages—it shows the compiler ecosystem is stable and powerful enough to sustain itself.
Why compilers matter
Tristan Handy: You said once that compilers are the foundation of every software ecosystem. What did you mean?
Lukas Schulte: There are two big drivers in software: abstractions and standards. You want one way to interface with a USB device—not ten. Same for software. You want one standard way to express a Python program, a JavaScript app, etc.
Compilers enforce those standards and make sure the same code works across platforms. That consistency powers things like package managers, shared libraries, and open ecosystems.
SQL dialects and fragmentation
Tristan Handy: Are there ecosystems that are doing worse than others?
Lukas Schulte: SQL does a particularly bad job. Anyone who's used more than one data warehouse knows you can't take the same SQL statement and expect it to work the same way. Casting, case sensitivity, functions—every engine handles these things differently.
Toward a universal SQL compiler
Tristan Handy: Can you convince me this problem is solvable?
Lukas Schulte: Yes. That's what we're working on with SDF—creating a shared intermediate representation for SQL. If we can express SQL logic in a unified form, we can compile it to any dialect—BigQuery, Snowflake, Redshift, and so on.
That allows developers to build reusable libraries, just like in other languages. It also makes governance, validation, and testing easier.
Future of data ecosystems
Tristan Handy: What would that future look like for practitioners?
Lukas Schulte: One major change would be the emergence of robust SQL libraries. Today, there’s no import
system for SQL. Everyone writes similar logic over and over.
A shared compiler abstraction would let us reuse components, collaborate across companies, and build an ecosystem of packages for transformations, metrics, and validations—similar to how we use NPM or PyPI.
This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.