The Analytics Engineering Roundup: 🎧 🆕 The Analytics Engineering Podcast

The Iceberg ecosystem today (Anders Swanson)

Dan Poppy — Sun, 08 Mar 2026 13:02:53 GMT

The data industry is moving towards open standards. The migration towards open standards throughout the data ecosystem is happening rapidly despite all the oxygen getting sucked out of the room from the rapid progress of AI and agents.

The dbt Labs data team is moving to an all Iceberg lake with a mix of compute engines to power transformation, analytics, and agentic experiences. The team has been able to move quickly towards this architecture because the entire ecosystem has been laying the groundwork for years. All of it’s coming together to make this new open world a reality fast.

On this episode, Tristan discusses the reality on the ground for data practitioners. Where’s the Iceberg ecosystem today? What can practitioners realistically expect when attempting to run on top of Iceberg in production?

Tristan is joined by Anders Swanson, a developer experience advocate at dbt Labs. Anders has spent a lot of time over the years navigating open-source data ecosystems and tracking their progress.

They unpack the open standards shift, define the core building blocks (query engines, object stores, catalogs), and dig into why external catalogs have become a fourth namespace tier across platforms. Anders outlines a pragmatic, phased adoption model for Iceberg integrations, explains why metadata performance and resiliency are hard requirements, and clarifies why vended credentials exist and what they solve.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

The call for papers is open for dbt Summit 2026. We invite data practitioners, platform leaders, and executives to share real stories of how data gets done at the world’s largest gathering of dbt community members. If you ship fast, reduce costs, improve trust, or bring governed AI to life, the dbt community wants to hear from you.

Submit a talk

Coalesce is now dbt Summit. Join the world’s largest gathering of dbt users, where data leaders and practitioners come together to shape the future of data analytics and AI.

Listen & subscribe from:

Key takeaways

Tristan Handy: I wanted to have you on because of work you’ve been doing internally to summarize the state of the Iceberg ecosystem. We’ve talked about Iceberg a bunch lately with folks deep in specific parts. Your work is more of an overview: where we’re at with platform integrations, what’s easier now than a year ago, and what’s still hard. Before we dive in, I want to define a few terms. When you say “query engine,” what do you mean?

Anders Swanson: It’s the thing that does your work. When you issue a CREATE TABLE or a SELECT statement, it’s what returns data or stores it somewhere for later.

Object store.

It’s the cloud service where you can store an object. An object is anything: a blob.

Catalog.

In this context, a catalog knows what tables and views exist and where they are, and how you can fetch or write to them.

Let’s talk internal versus external catalogs.

An internal catalog is what you get by default in a system like Snowflake or SQL Server. An external catalog is more like another directory, often managed by a different system. As you connect more disparate platforms, you can’t assume one system controls everything.

The complexity comes from duplication. How do you make namespaces unique? Can you plug in many external catalogs?

Abstraction matters. A common pattern emerging is one‑to‑one mapping of an external catalog into a database. That pushes a move to a four‑part namespace: catalog, database, schema, identifier. Spark moved toward this; Databricks Unity Catalog and Snowflake‑style catalog link approaches are in this family.

So the downside?

The devil is in the details, especially metadata performance and resiliency. For example, information schema listing. Users expect listing tables to be fast and reliable. In a federated world, if listing tables takes five seconds, users blame the vendor they’re using—even if the external system is slow. DuckDB draws a line by not mixing external catalog tables into information schema listing today. Snowflake’s catalog link databases appear to cache or mirror metadata so it feels as performant as native tables.

With catalog link databases, Snowflake is doing mirroring.

Yes. Mirroring exists in different flavors across platforms. Delta is sometimes seen as “simpler” because metadata can live in object store, but as soon as you want multiple engines writing, you still need a real catalog.

Sharing across multiple platforms adds another layer. What’s the state of platforms reading and writing to the same Iceberg catalog?

There are phases of integration.

Phase one is the naive approach: you have Parquet and JSON in object storage, and an engine reads it. Reading is easier than writing. You can get a toy example working.

Then you run into versioning and “what’s latest.” The next phase is connecting to an Iceberg REST catalog so engines can ask for the latest table version without users thinking about paths.

Phase three is schema‑scale: it’s never just one table. You need discovery of new tables, keeping schemas up to date, and eventually things like multi‑table transactions.

This maps to dbt Mesh and cross‑platform mesh. Producer vs consumer.

A consumer‑led model requires the downstream team to create pointers (DDL) to external tables. It’s operationally messy. Producer‑led is cleaner: the producer writes to the catalog and it’s just there, immediately queryable downstream.

Are platforms there yet?

Some support writing directly to external catalogs. When it works, it’s great, but there are still kinks. We’re retrofitting race cars designed for isolation to be interoperable without losing performance.

Identity is one of the hairiest issues. Vended credentials.

Vended credentials solve the “two keys” problem. You authenticate to the catalog, the catalog tells you where data lives, but then you need separate object store credentials to read files. Vended credentials means the catalog vends short‑lived credentials so you can access the object store location without managing separate keys.

That doesn’t solve user identity and grants.

Correct. Vended credentials isn’t global authorization. Identity and access across platforms is still hard. Ideally you grant access once and it works everywhere, but enterprises have different identity providers and platforms have different permission models. Today, admins often have to configure grants separately in each platform.

Is this mission creep?

The goal is to reduce how many people have to think about storage details. Big tech had whole data platform teams solving reliability problems in Hive‑era lakes. Iceberg reduces that toil dramatically, but the long tail is still auth, mirroring, and cross‑platform governance.

How does this reshape data teams?

Analytics engineering abstracted a lot of work. Data engineering has also been simplified by replication/orchestration vendors. What remains is the open ecosystem complexity: identity, object store policies, and cross‑platform connections. Many enterprises already have teams with these skills (infra as code, Terraform, Snowflake management), but others will need to grow into them.

Are vendors embracing Iceberg in good faith?

The goodwill and collaboration in the past 18 months feels unprecedented. We’re getting “more problems” because we solved prior ones. The industry aligning on standards feels like F1 teams standardizing components so they can innovate elsewhere.

In your internal writeup about Iceberg, you quoted Wolf Hall: “The making of a treaty is the treaty. It doesn’t matter what the terms are, just that there are terms, it’s the goodwill that matters. When that runs out, the treaty is broken, whatever the terms say.” Explain the relevance here.

When I joined dbt, it was taboo to mention one partner to another. Now vendors openly acknowledge mutual customers and invest in interoperability. On the Iceberg repo you see competitors collaborating on proposals. The goodwill is the standard.

Wrap us up with three things you’re excited for next year.

Push‑based catalog updates so platforms can subscribe to changes rather than repeatedly listing and polling. Progress on the small files problem so Iceberg works better for smaller data too. And more platforms supporting writing directly to external catalogs, unlocking producer‑led sharing and cross‑platform mesh.

Chapters

00:00:00 — Intro: why open standards are accelerating

00:01:20 — What practitioners can expect from Iceberg in production

00:05:00 — Lightning round: query engine, object store, catalog

00:06:20 — Internal vs external catalogs

00:09:30 — The “four-part namespace” and catalog-link style abstractions

00:11:30 — The downside: metadata performance, resiliency, and caching

00:17:10 — Sharing across multiple platforms: reality and tradeoffs

00:19:10 — Iceberg integration phases (1: naive table, 2: REST catalog, 3: schema-scale)

00:24:10 — Producer vs consumer model and cross-platform mesh

00:29:10 — Identity and “vended credentials”: what it is and what it isn’t

00:33:30 — The hard unsolved part: grants and global identity across platforms

00:37:00 — Is this mission creep? What Iceberg is optimizing for

00:39:50 — How roles on data teams evolve in an open ecosystem

00:43:40 — Are vendors genuinely aligned? Why Anders is optimistic

00:46:50 — “The making of a treaty is the treaty”: goodwill as the standard

00:51:50 — Three things Anders is excited for next year

This newsletter is sponsored by dbt Labs. Discover why more than 80,000 data teams use dbt to accelerate their data development.

Demo on-demand

Apache Iceberg and the catalog layer (w/ Russell Spitzer)

Dan Poppy — Sun, 25 Jan 2026 13:59:27 GMT

In this episode of The Analytics Engineering Podcast, Tristan talks with Russell Spitzer, a PMC member of Apache Iceberg and Apache Polaris and principal engineer at Snowflake. They discuss the evolution of open table formats and the catalog layer. They dig into how the Apache Software Foundation operates. And they explore where Iceberg and Polaris are headed. If you want to go deep on the tech behind open table formats, this is the conversation for you.

A lot has changed in how data teams work over the past year. We’re collecting input for the 2026 State of Analytics Engineering Report to better understand what’s working, what’s hard, and what’s changing. If you’re in the middle of this work, your perspective would be valuable.

Take the survey

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Tristan Handy: You spend a lot of your time thinking about Iceberg and Polaris. Give the audience background on how you found yourself in this niche of high‑volume analytic data file formats.

Russell Spitzer: It’s a bit random. I started at DataStax on Apache Cassandra as a test engineer and quickly got drawn into analytics. I saw big compute clusters and wanted to be involved. A coworker, Piotr, noticed Spark 0.9 and began a Spark–Cassandra connector. That got me into Spark. Over six to seven years I focused on moving data between Cassandra and Spark and into other systems. The interoperability problem across distributed compute frameworks was compelling.

This was pre‑Apache Arrow and pre‑table formats. We were just putting Parquet files everywhere and no one quite knew what they were doing. Pre‑Spark, people explored DSLs like Apache Pig. Eventually the industry converged on SQL for end‑user interfaces.

I later applied to Apple for the Spark team.

Helping build Apple’s Spark infra, or working directly on Spark?

Apple has an open-source Spark team and a Spark‑as‑infra team. I was trying to join the open source team, pushing Apple’s priorities into the project and supporting Spark as a service. During interviews, Anton—another Iceberg PMC—convinced the hiring manager I should join the data tables team, essentially Apple’s Apache Iceberg team.

They ambitiously planned to replace lots of internal systems with Iceberg. Iceberg existed but was early (Netflix started it around 2018/2019; I joined Apple in 2020). At Apple it was Iceberg all the time; convincing teams to move off older stacks, adopting open‑source‑as‑a‑service to save money, and getting onto ACID‑capable foundations. We were successful.

Migrations are hard. How did you make it accessible?

We replaced complicated bespoke reliability fixes with Iceberg. In Hive/HDFS, small‑file problems lead teams to write custom compaction and locking. Removing that toil is a big win. For big orgs, migration is a long‑term investment with ongoing engineering cost. For smaller companies, the key is offloading runtime responsibilities—ideally to SaaS—so engineers aren’t in the loop. Open source limits lock‑in so you can move between systems. Most companies are paid to deliver business value, not to build data infra. dbt is a great example of avoiding hand‑rolled pipeline code. Same logic applies to table/file formats.

Let’s talk Apache governance. What’s a PMC? How do projects run?

Apache projects aren’t owned by one company. Influence is earned by contributing to the community. The PMC governs merges, releases, membership. People move companies; the project stays with them. The goal is to make the project broadly useful. There’s no CEO dictating roadmap and no company can change the license.

Most big projects—Spark, Kafka, Iceberg, Flink—are maintained by employees of companies with vested interests, but governance is consensus‑driven. Vetoes are for technical issues (security, future‑limiting design), not ideology.

Is Iceberg for the top 20 tech companies or for everyone?

Not everyone needs Iceberg. OLTP belongs elsewhere. But for analytics, we should move past raw Parquet partition trees with folder‑name partitioning. In the Hadoop era, lakes were dumping grounds; schema evolution was painful. Many are still moving from CSV to Parquet. Over time, better encodings and table formats become default.

Decoupling compute and storage changes everything versus co‑located HDFS. Defaults tuned for HDFS (like 128MB Parquet files) don’t always hold for S3. We want elastic storage and compute; no one wants to pay for compute because storage grew.

Walk us through Iceberg versions.

v1: transactional analytics—ACID commits instead of fragile Hive/HDFS patterns. v2: row‑level operations—logical deletes via delete files so you don’t rewrite 10M‑row data files to remove one row; later compaction physically purges (key for GDPR). v3: expanded types—geospatial and variant for semi‑structured data; Variant was standardized across vendors and Parquet so everyone can write/read consistently.

v4: two thrusts—streaming and AI. Reduce commit latency, make retries faster under contention. Historically writes took 10–20 minutes, so commit latency didn’t matter. For streaming (writes every minute/five), it does. We’re evolving commit and REST catalog protocols so clients can specify intent (add these files, ensure these exist, then delete those) and let the catalog resolve conflicts server‑side.

On AI: Iceberg doesn’t yet serve some vector/image‑heavy patterns well. We’re exploring changes in Iceberg, Parquet, or both, without breaking existing tables.

Talk about Polaris and the catalog layer.

Polaris is an Apache incubator project (PPMC). Incubation proves we operate like an Apache project (community‑driven, trademarks donated). Iceberg defines the REST catalog spec/client; Polaris implements a catalog that speaks that spec. Many of us work across projects (Parquet, Iceberg, Polaris), which helps align boundaries.

Horizon, Polaris, external catalogs—what’s the story?

We’re simplifying: Snowflake can act as an Iceberg REST catalog, or you can use an external REST catalog. External can be Polaris (managed by Snowflake or self‑hosted) or another REST implementation. Interoperability means everything talks the same REST.

What is Polaris trying to be best at?

A broad, interoperable lakehouse catalog. It can act as a generic Spark catalog (HMS replacement) and aims to support multiple table/file formats. Architectural choices differ (KV vs. relational storage, where transactions live, policy enforcement vs. recording, identity integration). Polaris aims for base implementations that are pluggable—e.g., AWS/GCP/Microsoft identity.

Identity and scope—where does the catalog stop?

There’s a “business catalog” for discovery/listing versus a “system catalog” that must know table layout to govern access. Polaris can vend short‑lived credentials for the exact directory of a table’s files for a load operation; that requires understanding layout. Purely relational metadata often needs to delegate that decision.

Will identity/grants slow broad adoption?

Possibly. But many once‑complex things become default—compressed files, columnar formats, soon encryption. With collaboration (like Variant), we’ll land broadly accepted patterns.

Chapters

00:01:30 — Guest welcome and interview start

00:02:00 — Russell’s path: DataStax Cassandra, Spark connector, interoperability

00:05:20 — Joining Apple’s Iceberg team and early Iceberg momentum

00:06:20 — Why migrations resonated: replacing bespoke Hive/HDFS compaction/locking

00:09:10 — Apache governance 101: PMCs, consensus, and corporate influence

00:15:40 — How decisions land without votes; when vetoes apply

00:18:30 — Who needs Iceberg and where it fits

00:22:20 — Lake → lakehouse and warehouse → lakehouse in the cloud era

00:25:20 — Iceberg versions: v1 transactions, v2 row‑level ops (GDPR), v3 types

00:28:10 — Standardizing Variant across vendors and Parquet

00:31:10 — Iceberg v4 goals: streaming commit/retry improvements and AI use cases

00:33:40 — Commit latency and server‑side conflict resolution

00:37:20 — Polaris as an Apache incubating project (PPMC)

00:39:30 — Iceberg REST catalog spec and Polaris implementation

00:42:30 — Clarifying Snowflake Horizon, Polaris, and external REST catalogs

00:45:10 — What Polaris aims to be best at; pluggable identity providers

00:48:00 — Identity scope: business vs. system catalogs and credential vending

00:51:00 — Will identity/grants slow mass adoption?

00:52:50 — Wrap‑up

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Demo on-demand

AI agents and the data lake (w/ Lauren Anderson)

Dan Poppy — Sun, 11 Jan 2026 14:03:05 GMT

One of the interesting commonalities of AI and the data lake is that they both require new thinking around how we manage identity. For AI, the big question is how do agents interact with underlying data? For the data lake, the big question is how do we make open data stored outside the purview of any given data platform act like you’d expect?

In this episode of The Analytics Engineering Podcast, Tristan talks with Lauren Anderson, who leads the enterprise data platform at identity company Okta. Lauren discusses how identity sits at the center of two seismic shifts in data—AI agents and the open data lake—and why central governance and a shared semantic layer are critical. She lays out how analytics engineers and data engineers should divide responsibilities as agents begin to write a growing share of analytical queries.

Take the survey

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Tristan Handy: Before we dive into the current day, can you share a little bit about your background and how you came to the role that you’re in today.

Lauren Anderson: I’ve had a 20‑something year career at this point. I have basically spent my entire career in analytics some way, but my first data job was at a big bank. I won’t name it. There’s only a few big banks you could probably guess. I worked for the finance org and I did compensation planning and administration, with a side of sales tracking and analytics. I was part database analyst, part customer support for people that made a lot more money than I did.

I was there for seven, seven and a half, eight years. Towards the end of it, I became the owner and creator and almost business architect for our brand‑new sales tracking data warehouse. At a very young age, I got to think about how relational databases should come together for the outcome of both analytics and reporting—dashboards and whatnot—but also operations, which was paying compensation every month. It got me super excited about this world of data and being able to architect pipelines and the end‑to‑end flow for real‑world outcomes.

What do you think allowed you to be successful in that era? I often think the things that enabled success then aren’t the same as what make data folks successful today.

When I took it over, we ran compensation out of an Access database. I was new, the person who designed it left, and there wasn’t much documentation. It worked the first month, then broke the second—right before a payroll deadline. I rebuilt it as a long series of SQL queries with inline comments and step‑by‑step checks that produced a clean file. That willingness to throw away the brittle thing and rebuild with clarity and documentation gave me early success. The meta‑skills:ability to learn, take chances, and figure out the best path—still apply, but the technology is completely different now.

You’ve split time at Okta into two stints. How would you characterize the work?

Okta was my first truly B2B company. I realized quickly B2B data is my sweet spot. I love thinking about customers as businesses and how business users interact with our products and features. Okta data is complex—many products, features, and highly configurable use cases—especially with large customers. That variety is exciting. In simpler retail flows you see a lot of the same patterns; in B2B, the variety is the appeal.

What’s your current role?

I lead our enterprise data platform, engineering, and architecture function. For enterprise data used to make business decisions, we own ingestion into the warehouse, transformations, and delivery—dashboards, reverse ETL to third‑party applications, other data stores, and internal apps.

How big is the central function and how do you engage with the business?

We’re about 50 people across data engineering and analytics/data science in a company south of 7,000 employees. We support every business unit. Engagement spans a maturity curve. One end is platform self‑service: teams land data via approved connectors, build transformations in dbt on our implementation, and build dashboards in Tableau we administer. Governance and roles are defined centrally, and teams assign people to those roles. The other end is a white‑glove model where we partner through the full lifecycle—question, discover existing assets, requirements, data work, build, interpretation, validation, and end‑of‑life of the data product. Our sweet spot is the middle: we own enterprise “gold” pipelines for company‑level metrics—monitored and governed—while domains build and later graduate via a path‑to‑production under stronger governance.

Okta is known for identity and security. How does security‑first actually work in practice?

Reinventing controls every time slows you down. We invest in repeatable frameworks. Any new source goes through third‑party risk review, classification, and decisions on masking or exclusions. We help teams through that; after a couple times, they can engage directly with risk while we stay in the loop and monitor. As our classifications and expectations got clearer, review cycles shrank from weeks to days. It’s not all roses—it takes time—but we all operate as security practitioners. That shared mindset builds trust and reduces corner‑cutting.

How much do users need to know?

We don’t expect everyone to know everything. We provide dbt frameworks and minimum testing standards, plus SMEs to guide teams. The culture is to ask when unsure.

Will agents write more analytical queries than humans in the next 12–24 months?

Macro, yes. For us, more like 24–36 months because we’re careful. The key is safe, ethical AI consistent with being a security company.

How are you thinking about agent access?

Central governance. Ideally, agents query centralized, agent‑ready stores. Run governance once: policies, roles for users and for data, tracking and logging on a central plane. The semantic layer is essential. Creating semantic views must get easier and more automated, and semantics should inform policy application.

Why are agents different from humans in access patterns?

Row‑level security to the extreme. Conversational intelligence data should be limited to what the requesting user can access. Aggregations could be broadly accessible with anonymization, but detailed content should remain constrained. You might also limit allowed functions on large unstructured objects. Identity for agents matters—Okta Secures AI looks at distinct identity patterns to secure agents across applications.

Where are you with MCP and agent building?

Early, building support and insight use cases. Progress is fast, but nothing broad in production yet.

How should analytics engineers and data engineers participate?

Analytics engineers should own semantics—tooling, vendor choices, onboarding use cases, and the shared business language. Data engineers should optimize for consistency and scale, notice overlap across agents, and provide a platform others can build on with confidence in governance and security.

Will you standardize an agent development platform?

Yes, in partnership with engineering and shared services. Our current pull skews to the business, so we’re leaning toward accessible, governed platforms that serve both business and engineering with central governance.

Any assumptions you’re rethinking?

Treating everything like a relational model. Many initial agent questions are intentionally simple, where speed and reasonable accuracy trump perfect sophistication. The important thing is to start, observe, and mature.

Chapters

00:02:28 — From bank analytics to owning a sales DW

00:05:00 — Rebuilding brittle Access → SQL with documented checks

00:08:30 — Ops accountability then vs. optimization today

00:11:00 — TripIt, marketing analytics, and moving into tech

00:13:14 — Why B2B data became Lauren’s sweet spot

00:16:00 — Current role: ingestion → transform → delivery at Okta

00:18:10 — Operating models across business units and the path to production

00:22:20 — Security-first in practice: repeatable frameworks over friction

00:24:23 — Third‑party risk, classification, and shrinking review cycles

00:28:00 — Policies, masking, and the need for a central governance plane

00:30:20 — Frameworks for dbt, testing, and SME guidance

00:32:11 — Will agents outwrite humans? Macro yes; Okta timeline nuance

00:33:48 — Central governance and agent access patterns

00:37:19 — Semantic layer as bridge and policy carrier

00:41:00 — Function limits on unstructured data and Okta Secures AI

00:42:35 — Early MCP experimentation and support use cases

00:43:03 — Roles: analytics engineers (semantics) and data engineers (scale)

00:46:10 — Enabling an org-wide agent platform with shared governance

00:47:43 — Solve governance once, serve business and engineering

00:49:30 — Simpler questions first; rethinking relational assumptions

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Demo on-demand

Inside Snowflake’s AI roadmap (w/ Chris Child)

Dan Poppy — Sun, 14 Dec 2025 14:06:35 GMT

This season of The Analytics Engineering Podcast is focused on how the current data landscape is impacting the developer experience. Snowflake plays a major role in what that developer experience looks like.

In this episode, Snowflake VP of Product Management Chris Child joins Tristan to unpack Snowflake’s AI roadmap and what it means for data teams. They discuss the evolution from Snowpark to Cortex and Snowflake Intelligence, how to govern agents with row- and column-level controls, and why Snowflake is investing in Apache Iceberg and the Open Semantic Interchange initiative. dbt Labs recently open sourced MetricsFlow, the technology that powers the dbt Semantic Layer, to align with the goals of OSI.

Chris also shares a vision for the next five years of data engineering: fewer bespoke pipelines, more standardization and semantics, and a bigger focus on business context and data products.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Check out the dbt VS Code extension

Listen & subscribe from:

Key takeaways

Tristan Handy: Where have you spent your time professionally?

Chris Child: I didn’t end up in data on purpose. I found myself here through a series of hops. I was working at Redpoint Ventures and got excited by a company we invested in, RelateIQ. I left to join RelateIQ, building an intelligent CRM. We captured emails and meetings and built profiles of everyone you interacted with. We were acquired by Salesforce. Looking at what sales teams needed, I realized they also needed product usage data, marketing data, and campaign data, with a platform to pull it all together. That led me to Segment. I joined when it was about 50 people. Segment was mostly analytics.js then, loading different JavaScript on your webpage for tracking. We had just built the first warehouse connector to Redshift and got huge usage sending click and user data to Redshift.

The original Redshift connector was a nightmare to work with.

Like many startup things, one engineer built it in a week. Suddenly a ton of people used it, and enterprise customers depended on it. We had to rebuild it several times. You could see the future there. Folks I worked with went on to start companies like Census and Hightouch, thinking the CDP should be built on top of the warehouse, which Segment evolved toward. We also built a Snowflake connector because customers demanded it in addition to Redshift.

It’s funny to think back a decade to how small Snowflake was.

A couple customers demanded it; we built it, and we were sending a ton of data. That led to the realization that a customer data platform is one instance of a data warehouse, and there are others you need. Seeing how fast Snowflake was growing, I wanted to build the next layer of infrastructure.

I joined Snowflake seven and a half years ago. I’ve had three key roles. First, I built areas of the product: the UI, billing, product-led growth engines and free trial infrastructure, and application capabilities for connecting into and building on Snowflake. After Sridhar became CEO, he asked me to reconnect product and sales by leading solutions engineering, reporting to the CRO. Leading a global technical seller org was very different for a product person, but it helped align teams at scale.

About eight months ago, I returned to lead data engineering: how people bring data into Snowflake, how they transform it—spending a lot of time with dbt—and work around Iceberg and interoperability for worlds where not all data sits in Snowflake.

I didn’t realize the path started in investing. Are you a finance person way back?

My undergrad is in computer science. I started programming in fifth grade on an Apple IIe, learned C before high school, and followed that thread. In college I noticed business folks often made the decisions. I wanted to learn that side. After college I joined a consulting firm, then private equity, then an MBA. I realized I didn’t want to be a finance person. I moved to venture as a bridge to building products, but I wanted to build, so I jumped into operating roles.

Tell the story of Snowflake and AI. In the 2010s there was huge demand for easier, scalable, cloud-oriented data solutions. Then 2022 happened, ChatGPT launched, and the world changed. How did Snowflake respond, and where are you today?

Even pre‑2022 we saw customers putting their most important business data into Snowflake, then pulling data out for things they couldn’t do inside: training ML models and other analyses that SQL wasn’t a great fit for. Customers told us they didn’t like losing governance and lineage when data left. We invested in ways to bring more of that work to Snowflake.

Snowpark was the first big step: a runtime for non‑SQL code (Python, Java, Scala) with APIs inspired by Spark, plus capabilities like forecasting. It’s great for some workloads, but most customers don’t train most ML models inside Snowflake yet. We also acquired Applica for document extraction using early LLM techniques, and Neeva for web search based on LLM approaches.

When ChatGPT arrived, we saw two major influences. First, people wanted to chat with data they’d brought into Snowflake and transformed with dbt. That’s hard because LLMs are great with unstructured data and less great at turning business questions into correct SQL. Second, LLMs are very good at writing code, including Python and even dbt code. They’re not perfect for data engineering code yet, but they help.

Our goal is to help customers activate important enterprise data safely in AI models, deploy agents at scale under existing governance, and keep up with exploding data volumes without 10x headcount.

What are the key product pieces—Cortex, Snowflake Intelligence, etc.—in the Snowflake AI stack?

First, you need a great data foundation. That isn’t new: get the data in one place, apply good governance and permissions, know your data, tag PII, and raise the standard of care.

AI raises the bar because agents can expose sensitive data faster than dashboards. OSI (Open Semantic Interchange) work is part of this; LLMs need explicit semantics and cataloging they can consume, not tacit knowledge hidden in downstream tools.

Companies with strong hygiene move faster with AI. Roles matter; if a product manager role has access to certain rows and columns, an agent acting within that role can safely answer questions. Agents can run inside or outside Snowflake, but should assume appropriate roles when querying.

On the AI stack, after the data foundation, Cortex provides higher‑level APIs for unstructured processing, RAG, and structured processing. You can choose models (OpenAI, Anthropic, Mistral, Gemini, Llama, etc.), but most folks don’t want to manage prompts and GPUs. Cortex AI SQL lets you express intent like sentiment filters or fuzzy joins. It’s powerful for exploration but non‑deterministic, so you need care in production. Costs map to tokens at higher abstractions, with budgets and guardrails similar to variable compute in the cloud.

At the top, Snowflake Intelligence is a UI and agent framework. You define agents with access to specific datasets and semantic models, plus gold queries and usage guidance. It looks like a chat interface over your governed data. Inside Snowflake, we’ve deployed a GTM assistant that blends product usage, Salesforce, notes, docs, and content—structured and unstructured—respecting row‑level security for every seller while giving leaders broader access.

Let’s talk open formats and Iceberg. Why lean in when it opens up the data?

Our aim isn’t to lock up data, it’s to help customers get value. Snowflake began as a reaction to Hadoop—betting on SQL at cloud scale with our own formats and catalog because they didn’t exist then. Those proprietary pieces let us evolve quickly. Iceberg is now almost as good, and we’re contributing to make it better.

Openness is a win for customers and expands the universe of data Snowflake can query, run Cortex on, and power Intelligence with. The tradeoff is standards move slower. Variant type support is a good example—we contributed our approach and shepherded it into the v3 spec.

Next up, the community is wrestling with fine‑grained access control beyond table‑level policies. It’s hard and will take time, but the outcome should be better for everyone.

Give us your view on the future of data engineering.

Data volume is exploding, including unstructured data that’s now usable. You can’t hand‑build every pipeline. Demand is also exploding as agents query more things in more ways. Teams must operate at a higher level: automate, standardize, and reduce bespoke pipelines.

Expect more shared semantic models across consumers and packaged semantics coming from systems like SAP. You’ll also build data‑engineering agents to do work and monitor pipelines. The role looks more like architect and manager, allocating budgets, deduplicating work, and—most importantly—deeply understanding the business. The best data engineers shift from code output to data products, with clear semantics and context.

Talk more about context.

The day‑to‑day activity shifts, but the output is still data products. Great data products come with instructions, definitions, lineage, quality expectations, and how to get correct answers to common questions.

We need that context captured where work happens—models, visualization, quality systems—and made available everywhere: catalogs, agents, and UIs. As you build, you should also document, and those semantics should flow consistently into tools like Snowflake Intelligence so agents can reason correctly.

A big part of the challenge is selecting just‑enough context per question.

Chapters

00:01:50 — Chris’s path: RelateIQ, Segment, Snowflake
00:05:40 — Roles at Snowflake: product, solutions engineering, data engineering
00:09:00 — Snowflake and AI: foundations before ChatGPT
00:11:40 — Why keep ML and non-SQL work closer to governed data
00:13:40 — Applica and Neeva acquisitions, enterprise search context
00:14:50 — Two big AI influences: chat with data and code generation
00:16:50 — Scaling agents while preserving governance and cost controls
00:18:40 — Why governance must live at the data layer (roles, rows, columns)
00:22:00 — Inside vs. outside Snowflake: how agents assume roles
00:23:02 — Cortex: higher-level APIs over many LLMs
00:24:06 — AI SQL: joins/where by intent and the non-determinism tradeoff
00:27:40 — Cost models, tokens, and guardrails
00:29:10 — Snowflake Intelligence: agents over a governed foundation
00:32:10 — Open formats and Iceberg: Why Snowflake leaned in
00:36:00 — Standards tradeoffs: variant type and community progress
00:38:40 — Fine-grained access control for Iceberg: thorny but necessary
00:40:40 — The future of data engineering: scale, unstructured data, agents
00:43:20 — No more bespoke pipelines; standardized models, and semantics
00:44:50 — Data engineers as architects and business partners
00:50:00 — Code vs. context: data products and shared semantics
00:53:10 — Capturing context where work happens (models, viz, quality)
00:55:00 — Selecting just enough context for agent reasoning
00:56:30 — Closing

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

Building a multimodal lakehouse for AI (w/ Chang She)

Dan Poppy — Sun, 23 Nov 2025 14:03:30 GMT

Welcome back to The Analytics Engineering Podcast! Last season, we explored a host of topics on the developer experience (something the dbt Labs crew has been pretty vocal on recently). This season, we’re expanding that theme to look at how the current data landscape is impacting the developer experience. Open data infrastructure is on the rise; AI is pushing teams to rethink how data is modeled, governed, and scaled; and the developer experience is evolving.

In this episode, Tristan Handy sits down with Chang She—a co-creator of Pandas and now CEO of LanceDB—to explore the convergence of analytics and AI engineering.

The team at LanceDB is rebuilding the data lake from the ground up with AI as a first principle, starting with a new AI-native file format called Lance and building upward from there.

Tristan traces Chang’s journey as one of the original contributors to the pandas library to building a new infrastructure layer for AI-native data. Learn why vector databases alone aren’t enough, why agents require new architecture, and how LanceDB is building a AI lakehouse for the future.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Check out the dbt VS Code extension

Listen & subscribe from:

Key takeaways

Tristan Handy: You’re the founder and creator of the Lance file format and LanceDB. Before diving into vector search and vector databases, tell us about your background.

Chang She: I love talking to analytics engineers because that’s my background. I started about 20 years ago in quantitative finance. As a junior analyst, you do a lot of data engineering and analytics, which got me into open-source Python. I became one of the co-authors of the pandas library—initially to solve my own problem of not wanting to do analytics engineering in Java or VBScript.

You worked for a hedge fund?

Yes, AQR.

Did they know you were contributing to pandas? Hedge funds aren’t known for open source.

My roommate and colleague at the time was Wes McKinney. He showed me a proprietary Python library he was working on. It was life-changing. I started using and contributing. He spent about six months convincing the fund to open-source it. This was around 2010, and they were ahead of the industry in that respect.

I didn’t know pandas started at AQR. That’s fascinating. So much of your circa-2010 analytics work was done in early pandas?

Exactly. We went through several iterations, even debated the name. Because it was a hedge fund, there was a lot of econometrics and “panel data,” so Wes named it “pandas” for panel data analysis.

That origin story isn’t widely known. You then founded two companies, sold one to Cloudera, and were there during an interesting time.

Wes and I created DataPad—cloud BI before cloud BI really took off—and sold it to Cloudera. I spent about four and a half years in the Hadoop “big data” world, where I met my co-founder. He worked on HDFS at Cloudera, and several ex-Cloudera folks are at LanceDB today. After that I moved into machine learning at Tubi TV, working on recommender systems, ML serving, and experimentation/AB testing. That exposed me to embeddings. We dealt with videos, poster art images, and synopses—data that doesn’t fit neatly into pandas or even Spark data frames. That inspired me to build better infrastructure for these data types—what we now call “classical” machine learning—which led to LanceDB.

So that’s our bridge to vectors. You experienced these problems at Tubi, then founded the company. And Tubi used dbt?

Heavily. Thank you for creating it—it was critical to our stack.

Give us a non-technical intro: what are vectors used for?

Many people focus on the latest models and techniques. My perspective: everyone has access to similar models—your differentiation comes from your data and how effectively you connect data to AI. Vectors are a way to represent any kind of data in a form models understand: high-dimensional arrays of floating-point numbers—1,500, 3,000 dimensions, etc. Early statistical models might have a few interpretable dimensions; now you can have thousands where individual dimensions aren’t necessarily interpretable, but the space captures semantics.

Beyond RAG, vectors power internal model representations, recommender systems, and personalization—the original mainstream use case.

Search is also a good use case. How is vector search different from full-text search or Command-F?

Full-text search (e.g., Elasticsearch) returns documents containing the exact terms you searched. If you search for “customer,” it finds “customer/customers,” but might miss “user,” “adopter,” “organization,” etc. Vector search uses dense representations where semantically similar words and documents live near each other in high-dimensional space. Search for “customer,” and you get results that include semantically related terms.

Would you combine vector and full-text search?

Yes—hybrid search. Early RAG demos often used pure vector search for speed. Now enterprises need production-grade relevance. Many combine keyword and vector search with a re-ranking step to reach higher precision/recall.

Early RAG pipelines often chunk text, embed, and call it done. But more thoughtful pipelines do something closer to feature engineering, right?

Absolutely. Thought goes into what you feed the embedding model. For example: add a document- or section-level summary alongside each chunk before embedding; include multimodal features—artistic descriptions, literal captions, tags; create multiple embedding columns (e.g., different prompts/modalities) and search across them with re-ranking. High-quality retrieval requires feature-engineering-like decisions before embedding.

Let’s talk vector file formats (Lance) and vector databases (LanceDB). My crude belief: a vector database is a standard database with additional indexes. True?

Not wrong, but my hot take: with Lance and LanceDB, we’re building a lakehouse for multimodal data that includes vectors. Many “vector databases” are optimized only for vectors and struggle with other data types and workloads. The category needs to evolve—either toward new-generation search engines or new-generation lakehouses. We set out from day one to build the broader lakehouse, not just a vector index.

Outline your AI-enabled data lake vision. I’m familiar with Snowflake and Databricks’ lakehouse. How do you see the world differently?

We assumed everyone would use Parquet and tried for months to support AI workloads—search, training, preprocessing—on it. We couldn’t make it work well. Talking to computer-vision and ML practitioners, no one had something effective. That gave us confidence to build a new format.

In AI you manage vectors, long documents, images, and videos. The first problem is storage. With Parquet, mixing wide blob columns with narrow metadata columns leads to out-of-memory issues due to row-group design. If you shrink row groups to fit blobs, read performance tanks.

Even once data is in Parquet, AI needs random access and secondary indexes. Parquet doesn’t support efficient random row access: retrieving scattered rows forces reading entire row groups. With media, that’s prohibitively expensive—both for search and for training (e.g., global shuffle). Data evolution is also hard: with table formats like Iceberg, backfills often mean copying entire datasets. Copying petabytes of media is a non-starter. These issues motivated Lance.

I have a good mental model of Parquet with structured data. With images or video, do you put them in blob columns?

Yes. We use Apache Arrow types. Images/audio/video are large binary columns. Vectors are fixed-width list columns (e.g., 1,536-dimensional). But Parquet’s row-group mechanics and lack of random access make these workloads painful.

So Lance was the first thing you built. It has solid traction on GitHub. Who uses a file format—users or vendors?

Both. Frontier labs use Lance to store training data—e.g., for image/video generation—replacing stacks like TFRecords, WebDataset, Parquet, and BigQuery. Large tech companies and vendors also build on Lance: Databricks, Tencent, Alibaba, Netflix, NVIDIA, Uber, among others.

Databricks uses Lance?

For parts of their AI-specific offerings.

You’ve raised several rounds—the format is Apache-2 licensed. How do you commercialize?

Our commercial offering is a data platform for large-scale AI production: vector search, data preprocessing, training/serving cache, and an analytics engine for curation and exploration. It supports ML training workflows and AI application development, solving the hard distributed-systems problems along the path. We partner closely with big vendors; we’re generally not competitive because goals and customer bases differ. Cloud providers seek platform consumption; we focus on an AI-optimized data platform for specific workloads and users.

The commercial product is called LanceDB, but you prefer to position it not just as a database.

Right—we’re an AI-native data platform/lakehouse for multimodal data, with Lance as the common format.

How does this space play out over the next two to three years?

Two big predictions. First, multimodal will be 100× bigger—more usage and more data. Audio is exploding; video generation is resurging; robotics is next. Second, our data infrastructure isn’t ready for agents driving search and retrieval.

Let’s unpack both. On multimodal: unlike structured analytics, where every company needs it, multimodal workloads seem concentrated. Do all enterprises really need this?

I think every enterprise becomes multimodal. Take insurance: tons of documents to digitize, extract, search, and analyze; drones capturing images/video to assess risk and improvements over time. Existing businesses become more efficient; AI-native entrants gain structural advantages. Multimodal data underpins both.

It’s a heavy lift. Will every Fortune 500 insurer build these capabilities in-house, or will vendors package them?

Likely both—just like analytics engineering emerged as a role, with adjacent talent re-skilling. We see the same with AI engineering.

What titles are hands-on with your product?

AI researchers and AI engineers. Many app developers building AI features now carry the “AI engineer” title.

On agents: how do their access patterns change platform requirements?

RAG was one-shot: ask, retrieve, answer. Agents iterate: they decompose problems into sub-questions, refine queries and results, and run many steps in parallel. Load skyrockets—humans type slowly; agents can issue hundreds of queries simultaneously. Queries are more varied and selective, and agents are creative in combining modalities and sources: schemas, SQL over structured data, prior analyses and charts, document stores, image/video metadata, etc.

Traditional vector databases aren’t designed for this breadth and scale. If you bolt together multiple specialized systems, your “agent stack” balloons into a maintenance nightmare. Our approach: put all data in one place with a single system that supports vector search, keyword search, filters, key-value lookups, re-ranking, analytics, and efficient random access—on top of an AI-native file format (Lance).

For listeners whose curiosity is piqued, any resources you recommend?

Chang She: Yes—our blog series by Weston Pace, the tech lead for Lance format. It dives into encodings, I/O, and has great reads for analytics engineers: lancedb.com/blog .

Chapters

00:00 – Intro: Analytics meets AI
03:20 – Chang’s background and how Pandas began
06:40 – Lessons from Cloudera and metadata
08:30 – Multimodal data and LanceDB’s origin story
10:00 – Why vector search matters (beyond RAG)
12:00 – What are vectors and why do we use them?
15:00 – Full-text vs vector search
18:00 – Feature engineering in AI use cases
21:15 – Lance format
28:00 – Storage, scale, and the problem with Parquet
35:30 – Building a business on open source
41:00 – Two big bets: multimodal data and agents
46:00 – Every company will become multimodal
50:00 – Agent access patterns will redefine data
54:00 – Why dbt-style workflows matter now more than ever

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

Agentic coding in analytics engineering (w/ Mikkel Dengsøe)

Dan Poppy — Sun, 07 Sep 2025 12:01:00 GMT

What does agentic coding look like in analytics engineering? Mikkel Dengsøe, co-founder at SYNQ, recently wrote a series of posts on his experiences as an analytics engineer with agentic coding tools. In this episode of The Analytics Engineering Podcast, he walks through a hands-on project using Cursor, the dbt Fusion engine, the dbt MCP server, Omni’s AI assistant, and Snowflake.

Tristan and Mikkel cover where agents shine (staging, unit tests, lineage-aware checks), where they’re risky (BI chat for non-experts), and how observability is shifting from dashboards to root-cause explanations delivered to the right person at the right time. Along the way: practical prompts, why “one model at a time” keeps you in control, and a testing philosophy that avoids alert fatigue while catching what matters.

To see real-world use cases of agentic coding and to learn directly from data and AI leaders, join us at Coalesce 2025 in Las Vegas, Oct. 13-16.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Can you talk a little bit about your background?

Mikkel Dengsøe: Yeah, so I can start from the beginning. I've been in data for, I think it's coming up to 15 years now, and started my career in data at a Danish shipping company, which was very much zero to one. When I came in, there was no data warehouse, and the only way we could know how many containers were shipped was by an IT guy pulling that out of the system every six months. I then spent two years there building up their data warehouse on SQL Server, which was super fun. After that, I spent five years at Google, which was a very different gear.

That's a natural transition. Just global shipping company straight to Google.

Exactly. And that was very much a hundred-to-end where, in my case, I worked with the ads data and you get a perfectly curated data table that you can work with and everything kind of works. Then after that I joined a company called Monzo. For those who are not familiar, it's a scaling fintech out of the UK and that was very much the one to a hundred. When I joined we were 30 data people, but scaled to a hundred over two years. We had 10,000 dbt models and we built every internal tool under the sun for dbt. Super interesting. And then three and a half years ago I went on to found SYNQ alongside Peter and Steve, which is a data observability platform.

Tell us a little bit more about SYNQ.

We are a data platform that primarily works with companies using tools like dbt already, but have issues going from important data to business-critical data. That might be customer-facing dashboards, machine learning models, or something else. They want better monitoring—we often deploy anomaly monitors—and they also want workflows such as incident management for when things go wrong. We were founded in 2022, so now we're in early stages of working with scale-ups and startups, and now also onboarding enterprises and larger companies. It's been a fun journey.

In your series of blog posts, you went through the modern data stack and said, “What's the most current version of this tool and how effectively can I AI-ify that?” Whether that's using Cursor to build dbt models or using the agent experience inside of Omni—what made you decide to get into this and write about it?

The first part of it is just: it's super fun to tinker with these tools and try them out. It's magic. And we were also building an MCP server at SYNQ, so I had a lot of interest in seeing how it works with others and what we can learn. It was also driven by being able to have conversations with our customers. When they ask about it, being able to speak from the point of view of having actually tried this and seen what works and what doesn't.

The early days of using Redshift were such a visceral experience relative to what came before. If I hadn't interacted with it directly, I wouldn't have understood how big a state change cloud data was. This feels like another one of those moments: if you don't have hands-on experience, you're not going to really get it. Fair?

Spot on. And I think pretty much every data team should be doing this unless they have a very good reason not to. The risk and the stakes can be pretty low if you use it for internal workflows like data modeling and writing tests. You're still in control. I recommend everybody do it.

What tasks did you try to accomplish?

It's three different blog posts: the data modeling part, the testing part, and then exposing it in Omni's AI agent where people can ask questions about the data. There's a fourth post: once the data is live, how can you use the SYNQ MCP to do things like root-cause analysis and planning changes. I started with data modeling. I had raw data from different JSON sources, some XMLs, some profiles—extracted and put into Snowflake—and then did the data model.

So the data was already loaded into Snowflake?

Yeah, exactly. For the data modeling, I started from the sources and then worked through staging, marts, and finally metrics using the semantic layer. Each step looks a little different when you use AI tools because the behavior differs. In terms of tooling, I used Cursor with the dbt-MCP plugged in. If you're not familiar, dbt-MCP lets you, via prompt, interact with dbt tools—execute dbt build, get models, or get everything upstream of a given model—so you can chain work without explicitly doing it.

Cursor + dbt-MCP. What model did you use?

I just used the default in Cursor, which I believe is Claude. There's an important distinction: Cursor is really good at writing code, but it can't execute queries on your behalf. If you want to extract raw data and query Snowflake to get rows out, you have to do that in Claude Desktop. That became key. Early on, as I built models, the first thing I did was get a snapshot of sample data from Snowflake—10,000 rows of a source. I fed that into Cursor and said, “These are examples of what this data looks like.” Using that data, Cursor could model in a clever way. For example, a column called quarter like “2025 Q1”—Cursor understood to translate it into a datetime and do the transformations.

I've used the dbt MCP server a decent amount—less in Cursor, more in Claude Desktop. Your stack was Cursor + Claude models + Claude Desktop. And Cursor cannot directly execute queries in Snowflake, but Claude Desktop can. Is that because there’s tool use Claude has that Cursor doesn't?

I believe so. In Claude Desktop, if you write queries against dbt-MCP, Claude can visualize a graph, show outputs of a SQL statement, etc. Cursor, as far as I know, couldn't. My middle ground was to take sample data out of Snowflake, put it into a CSV, and feed that back into Cursor so it could look at raw data.

As part of its own context window?

Exactly. That was key for my workflow. Then when I wanted to write unit tests, I could use real data examples from the sample. Or when automatically documenting the data, I asked Cursor to specify examples in the docs based on the most common occurrences within a column. Letting Cursor peek at raw data was a core pillar.

It's a little hacky, right? Cursor should really be able to interact directly with Snowflake or Databricks to investigate the shape of the data. Agents should be empowered to do that.

I would say so. There might be a way I didn’t know about, but I patched the gaps by uploading into the context window.

So that's the state of the art today.

Seems so. To be clear, I think the limitation is IDE differences—Cursor vs. Claude Desktop—rather than dbt-MCP itself.

Once you had sample data in context, did you have to suggest conversions, or did it naturally do them?

It got the defaults pretty right, but I guided it on what I wanted from the source data. I wanted control over everything, so I asked it to do one model at a time rather than auto-generate a whole stack. That way I could review each step and stay in control.

Your prompt workflow was “Build me a model with this name that stages the data from this table,” basically?

Yeah. When it proposed code I didn't like, upstream it was usually simple (regex to parse dates, etc.). Downstream, in marts and metrics, I started describing my ideal data product: user jobs-to-be-done and the final output. That’s when Cursor got creative and invented metrics I hadn’t anticipated—like “apartment price relative to time on market.” I pruned ones I didn’t want, but some were good surprises.

Which layer did it help most?

Testing. Modeling was good—especially staging—but testing accelerated significantly. SQL is a bit like English; for simple datasets you can express intent easily. Testing can be much harder and more verbose.

Roughly how much more effective did you feel?

Modeling: multiples faster. It nailed the tedious parts—regex, casting, pass-throughs—so staging/intermediate layers flew. In marts/semantic metrics, the benefit was brainstorming. It helped me think of metrics I wouldn't have.

Did the dbt Fusion engine help?

Yes. Fusion shows lineage and whether a column is pass-through. For example, if a column is pass-through with no transforms, don't add another not_null or unique if there's one upstream. I bounced between the IDE to check this and codified it as a testing strategy. That's already top-10% testing hygiene.

Any MCP feature requests surface?

The more context and tools the agent has, the more it can do. In the fourth post, for root cause analysis, we used the SYNQ MCP. We collect all your Git commits and have history, so the agent could correlate recent code changes with incidents. Requests depend on the job at hand.

Let's move to testing—why was it the most additive?

Testing is hard; many teams don't know how to do it and alert fatigue is common. A huge share of tests we see are not_null/unique, which doesn't reflect real data risks. First thing I did in Cursor for testing was provide our internal testing philosophy as guidelines: test heavily at the source, don't retest pass-through columns, focus on business and metric anomalies in marts. That worked really well. For sources and staging, it generated relevant tests. Then for marts, I asked for unit tests and gave it a thousand sample rows from Snowflake. It wrote very relevant unit tests I’d otherwise spend a lot of time on.

Examples?

Simple ones like: when you pass a string value in the date column, does it transform correctly to datetime and match the expected format? These just worked. Then at the metric level, it looked at raw data and proposed assumptions—like square-meter price should be between X and Y—sometimes segmenting by postcode. Very thoughtful, though I'd replace static thresholds with anomaly monitors so they don't go stale as prices move.

So at least 5× on testing?

At least. Apart from swapping static thresholds for anomaly detection, it nailed testing and did so in a lineage-aware, layer-appropriate way.

Tell me about the BI layer.

Many teams start at the BI layer with a chat interface. I think that's risky because it's used by business users and you only get so many chances before trust drops. I moved into Omni. You create a “topic” (a data model you can join with others) and then specify an AI context: instructions for how the LLM should behave. For example: if a user asks about price, always return square-meter price; never make up fields not present in the mart; if asked about provenance, mention the source. Writing AI context is a new skill for our industry.

Were you using Omni’s AI assistant to create assets faster, or to let users self-serve?

The latter—so users could ask questions instead of going to a dashboard. It could have been any BI tool with similar functionality; we just use Omni internally.

And how was the experience as a consumer?

Amazing when it works, but I'd hesitate to give my VP of Marketing access. It gets things wrong maybe one in five times, and it's not obvious why if you're not a data person. For analysts doing exploratory work, it's great—they can inspect and dig in. I wouldn't replace company-wide dashboards with a chat bot yet. Omni does log freeform queries and feedback, so there's a path to iterate the AI context over time.

The last thing you did was use AI plus SYNQ to monitor production infrastructure. What does observability look like in the future? Historically it's looked like dashboards—Datadog for data pipelines. Is it just more effective monitors, or fundamentally different?

Fundamentally different. We’re heading to a place where observability tools can tell you what's wrong at the right time, with just the right context, delivered to the right person—inside or outside the data team. Done well, there may be few dashboards; instead you get an LLM-summarized root cause delivered from a monitor that might be auto-created. Less “active tool you poke at,” more “proactive explanation.”

Still technical observability (pipelines/data issues), or business observability?

More the former. Teams at the edges—Sales Ops managing Salesforce, engineering teams creating web events—often need to be notified about data issues. Business KPI movements require a different experience for marketers, etc.

Automated remediation?

Gradual. You can imagine an issue occurs without a dedicated test; the system proposes a new test. But 80% of issues come from root systems elsewhere (someone typing in Salesforce), and closing that loop is still hard. In the article’s fourth part, we had a data issue and I asked the SYNQ MCP through Claude Desktop to do root cause analysis. It walked the same steps a data person would: inspect the model, check errors, examine lineage and upstreams, review recent commits, and documented each step to the root cause. That works now.

At the beginning you said there’s no good reason not to use these tools today. What reasons do you hear for not trying?

People are busy. But if you look at a risk curve, lowest risk is modeling and testing—you're in the driver's seat and it boosts productivity. Higher risk is replacing your BI tool with a chat bot; higher still is customer-facing experiences. The first two are hard to argue against.

Enterprise IT approvals might be one blocker—approved models, data access, etc.

True. For example, our MCP can query raw data to detect if an issue happens in a segment, and enterprises might hesitate there. Also, “MCP” as a term can be confusing. But it's actually simple and explainable, not a black box. Setting up dbt-MCP can still feel hacky in enterprises; if it lived natively in cloud environments, it’d be easier to adopt.

You can set it up locally—no permissions/procurement—and just play. We also shipped the MCP server as a remote MCP in cloud, though that introduces auth/permissions considerations.

If I had to pick a persona, it's the analyst. Analysts have had a tough decade: more tools, harder workflows, less time to tinker. MCPs and AI workflows are a turning point. At Monzo, we had a philosophy that you should be able to have an idea on your commute and have it implemented by midday. As we grew to 10,000 dbt models and long CI checks, that faded. I can see a world where this returns. MCPs can help. I'm excited.

I love that. Analytics engineers think “infrastructure, correctness.” Analysts think “idea to validation fast.” Excel was always the analyst’s best friend because it's fast and flexible. MCPs make it easy to plug tools together and get answers quickly again.

One company we work with—Voi, a scooter company out of Sweden—has a strong data leader, Magnus, who is very bought into metrics. Their data team doesn't produce dashboards; they produce metrics. In an AI world with MCPs, flows, and curves, that's a clear decision.

I believe there's no such thing as the wrong BI tool—different tools have different trade-offs. Probably true for models/IDEs too: Claude Desktop vs. Claude Code vs. Cursor—no single “right answer” as long as the underlying context and metric definitions are shared.

Agreed. What really matters across workflows: consistent metric definitions, documentation for columns and fields, and high-quality data. Those foundations matter even more when an LLM is in the loop; you may not have a human sanity-checking every result.

Chapters

00:00 — Tristan’s intro
01:10 — Mikkel’s background: shipping → Google → Monzo → SYNQ
03:08 — What SYNQ does (data observability for business-critical data)
04:15 — Running the experiment
06:23 — Scope: modeling, testing, BI agent, observability
07:17 — Tooling: Cursor + dbt MCP server + Snowflake + Omni
09:38 — Sampling real data into the agent’s context
13:14 — Modeling workflow: one model at a time
15:14 — Where agents help most: testing > modeling
18:10 — dbt Fusion engine: lineage-aware checks, fewer redundant tests
19:50 — Feature requests and root-cause via commit history
20:57 — Testing philosophy: source-heavy, pass-through aware, metric-level
22:49 — Unit tests from samples; thresholds vs anomaly monitors
25:10 — BI agents: great for analysts, risky for broad rollout
31:54 — The future of observability: explain first, dashboards second
36:10 — Adoption curve: safe places to start
40:49 — Analyst superpowers return
42:04 — Metrics over dashboards

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

Under the hood of Apache Iceberg (w/ Christian Thiel)

Dan Poppy — Sun, 24 Aug 2025 13:03:00 GMT

If you're a data practitioner, you likely understand Iceberg as a user, why it's important, and how it's changing the way that we build data systems. But you may not know a lot about what going on beneath the surface.

There are multiple ways to interface with Iceberg catalogs, multiple versions of the Iceberg REST spec. There's several leading catalogs that implement that spec. All this in an ecosystem that includes companies of all sizes, in proprietary and open-source code, and in academic and commercial contexts.

In a few years, all this ambiguity will be behind us, but right now it's very much evolving in real-time. To get an update on the status of the Iceberg ecosystem and to walk through all the developments, Tristan talks with Christian Thiel. Christian is one of the lead architects of Lakekeeper, of one of the most widely used Iceberg catalogs.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Walk us through your background

Christian Thiel: I started in natural language processing, then moved into machine learning applications in manufacturing. Like many people, I realized that the biggest barrier wasn’t the algorithms but the data—its availability, quality, and accessibility. That led me deeper into data architecture and engineering, eventually to building Lakekeeper.

What is Lakekeeper, and what are you building now?

Lakekeeper is an Iceberg catalog implementation—a technical requirement for building distributed, composable analytic systems based on Apache Iceberg. But our vision goes beyond that. We see the future in data collaboration and reliable sharing of data, supported by clear contracts.

For listeners new to Iceberg, what makes it so important?

Iceberg allows organizations to store data once, in an open format, and then use the compute engine best suited for each workload. It’s a foundation for building modern, composable data platforms while avoiding vendor lock-in. If there’s one thing that should be open, it’s the data at the center of your platform.

Some folks might say this sounds like Hadoop all over again—lots of open standards that are hard to integrate. Why is this time different?

The ecosystem has matured. Even big vendors like Snowflake and Databricks are embracing Iceberg, which shows there’s a strong shift toward openness. Plus, the tooling and infrastructure are much easier to deploy today. A modern Iceberg setup is far less complex than a Hadoop environment used to be.

Let’s talk about what’s happening under the hood. How does Iceberg work?

Iceberg organizes data using a metadata hierarchy. At the top, there’s a JSON file that stores high-level table information: snapshots, schema, and locations. Below that are manifests and other layers that keep track of files. This hierarchy is what makes things like time travel, atomic transactions, and schema evolution possible.

What about ongoing maintenance?

There are two key tasks. First, expiring old snapshots so you don’t accumulate unnecessary files. Second, compaction—combining many small files into larger ones

Catalogs are another critical piece. What role do they play?

Catalogs manage the top layer of metadata and coordinate transactions. They make atomic updates possible, allow multiple writers, and handle governance—things like access control and multi-table transactions.

How enterprise-ready is Iceberg today?

Very ready. A year ago, there were still gaps, but today, performance and feature parity with native tables on platforms like Snowflake and BigQuery are strong. Governance and authorization models are still evolving, and different catalogs implement them differently, but the core functionality is there.

Speaking of catalogs, how should someone pick between options like Lakekeeper, Polaris, Unity, AWS Glue, or Gravitino?

Christian Thiel: It depends on priorities. Lakekeeper focuses on performance, extensibility, and ease of use. Polaris is developer-focused but less user-friendly. Unity is tightly integrated into Databricks. Glue now supports the Iceberg REST spec, which makes it more interoperable than before. Gravitino is another option aimed at enterprise-scale environments.

Recently, DuckDB announced DuckLake. What’s your take on that?

It’s interesting, but there are two concerns. First, it uses a database schema directly for the catalog, which creates interoperability issues—similar to the early JDBC catalog in Iceberg that the community eventually moved away from. Second, it was built without community involvement, and openness without adoption isn’t really openness.

That said, for heavy DuckDB users, it could offer optimizations that make queries extremely fast, and if the broader ecosystem adopts it, it could become a viable open format.

What’s next for Lakekeeper?

We’re continuing to invest in table optimization, enterprise features, and data collaboration tools. Our vision is what we call the “unbreakable lakehouse,” where contracts and collaboration guardrails make shared data more reliable. Long-term, we see Lakekeeper as enabling truly collaborative, open data ecosystems.

Chapters

00:00 – Introduction
Tristan Handy introduces the episode and the focus on Apache Iceberg.
01:40 – Christian Thiel’s background
From natural language processing to data engineering.
04:30 – Introduction to Lakekeeper
What Lakekeeper is and its role in the Iceberg ecosystem.
06:00 – Why Iceberg matters
How open table formats enable flexibility and reduce vendor lock-in.
11:40 – How Iceberg works under the hood
Metadata hierarchy, catalogs, and how state is managed.
21:30 – Maintenance and optimization
Snapshot expiration, compaction, and keeping tables performant.
24:20 – Catalogs and governance
Access control, multi-table transactions, and security.
31:40 – Enterprise readiness
How Iceberg is evolving for production use in large organizations.
42:10 – Choosing the right catalog
Overview of Lakekeeper, Polaris, Unity, Glue, and Gravitt.
47:20 – DuckLake discussion
Pros, cons, and ecosystem adoption challenges.
52:00 – The future of Lakekeeper
Data contracts, collaboration, and building the “unbreakable lakehouse.”

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

The pragmatic guide to AI agents in the enterprise (w/ Sean Falconer)

Dan Poppy — Sun, 03 Aug 2025 13:02:53 GMT

What does it mean to be agentic? Is there a spectrum of agency?

In this episode of The Analytics Engineering Podcast, Tristan Handy talks to Sean Falconer, senior director of AI strategy at Confluent, about AI agents. They discuss what truly makes software "agentic," where agents are successfully being deployed, and how to conceptualize and build agents within enterprise infrastructure.

Sean shares practical ideas about the changing trends in AI, the role of basic models, and why agents may be better for businesses than for consumers. This episode will give you a clear, practical idea of how AI agents can change businesses, instead of being a vague marketing buzzword.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Sean, can you give us the TLDR on your career and what you're working on today?

Sean Falconer: I've always worked at the intersection of data, engineering, and AI. From academia studying computer science, into industry as a founder, then to Google, I worked on conversational systems and privacy/security in AI. Currently, at Confluent, I'm leading our AI product strategy, balancing both technical and go-to-market roles.

You moved from being deeply technical into marketing and sales. What drove that transition?

I was forced into it as a founder. Initially uncomfortable, but it taught me huge respect for marketing and sales. I had to learn by making many mistakes, eventually building out entire marketing and sales functions. I realized how challenging and critical these roles are.

You were at Google before ChatGPT launched. Did you foresee the transformative nature of these technologies?

Honestly, no. Having seen earlier disappointments in conversational AI (like Microsoft's Alice), I was skeptical initially, even as ChatGPT emerged. It wasn’t obvious we'd soon experience this revolution.

You’ve written about three waves of AI. Can you describe these?

Yes. Wave one was predictive AI, traditional ML models trained for specific tasks like fraud or spam detection—effective but rigid. Wave two introduced generative AI, or foundation models, trained on vast general datasets, flexible but lacking specific business context. The third wave, agentic AI, involves AI systems that can reason, dynamically choose tasks, gather information, and perform actions as a more complete software system.

Do foundation models replace traditional ML methods?

Sometimes they can, but it doesn’t always make sense. An LLM might do sentiment analysis well enough, but a traditional model may be more efficient and cheaper. Think of using an LLM as cutting steak with a chainsaw—possible, but unnecessary.

Let's clarify "agents." What makes software truly agentic?

It’s software that can dynamically decide its own control flow: choosing tasks, workflows, and gathering context as needed. Realistically, current enterprise agents have limited agency to ensure reliability. They're mostly workflow automations rather than fully autonomous systems.

You mentioned a spectrum of agency. Is this similar to autonomy in self-driving cars?

Exactly. Highly autonomous agents are appealing but not practical yet. Most enterprise success stories involve structured workflows with clearly defined boundaries.

Why have agents taken off more in enterprises than consumer apps?

Enterprises have many well-defined, high-value tasks perfect for automation. Consumer scenarios demanding high agency—like planning complex trips—are still too unreliable. Enterprises can benefit significantly even from limited agentic capability.

Is an agent just a microservice?

In many ways, yes. An agent functions like a microservice with extra capabilities (using LLMs for decisions). Deployment considerations like state management and long-running tasks differ slightly, but fundamentally it’s similar.

What tools and frameworks help build effective agents?

Start with frontier models like GPT-4 or Claude. Frameworks include LangChain, Microsoft Autogen, and CrewAI. But for real-world deployment, treat it as rigorous software engineering with observability, scalability, and robustness in mind.

Are organizational barriers bigger than technical challenges?

Yes. AI efforts are often mistakenly tasked to data science teams rather than cross-functional software teams. Successful companies create dedicated teams blending software engineering skills and data expertise to build reliable agentic systems.

What pitfalls should teams avoid?

Avoid building monolithic agents. Break systems into smaller, well-defined units in a multi-agent architecture. Use event-driven frameworks to avoid rigid, hard-to-maintain dependencies.

Chapters

[00:00] Introduction: What's all the hype about agents?
[01:10] Meet Sean Falconer: A journey from engineer to AI strategist
[04:10] Learning marketing as an engineer-founder
[05:50] Inside Google's AI efforts before ChatGPT
[09:00] What does it mean to run AI strategy?
[10:45] Three waves of AI: Predictive, Generative, and Agentic
[16:30] Will foundation models replace traditional ML?
[18:30] Defining agents clearly: Beyond the buzzword
[22:00] The spectrum of agency: From controlled workflows to open-ended tasks
[25:30] Why agents fit better in enterprises than consumer apps
[28:00] Agents as microservices: A practical view
[35:00] What tech stack is needed to build effective agents?
[37:50] Organizational challenges in adopting agents
[39:30] Models that are favorites for developers
[43:30] Why software engineers are best placed to build agents
[46:00] The technical stumbling blocks in building agents
[48:00] Concluding thoughts: Beyond POCs to production agents

This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

Book a demo

How Amazon S3 works (w/ Andy Warfield)

Dan Poppy — Sun, 20 Jul 2025 12:02:56 GMT

In this season of the Analytics Engineering podcast, Tristan is deep into the world of developer tools and databases. If you're following us here, you've almost definitely used Amazon S3 it and its Blob Storage siblings at Microsoft and Google. They form the foundation for nearly all data work in the cloud. In many ways, it was the innovations that happened inside of S3 that have unlocked all of the progress in cloud data over the last decade.

In this episode, Tristan talks with Andy Warfield, VP and senior principal engineer at AWS, where he focuses primarily on storage. They go deep on S3, how it works, and what it unlocks. They close out talking about Iceberg, S3 table buckets, and what this all suggests about the outlines of the S3 product roadmap moving forward.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Operating systems, garage sales, and Xen

Tristan Handy: You’ve done a lot over the last 20 years. Before we get into specifics, can you just share a little about your journey as a software engineer?

Andy Warfield: I just like playing with computers. I studied computer science in Ontario for undergrad, then moved to Vancouver for grad school, then to the UK for a PhD. I worked on operating systems, low-level stuff. I got to work on a hypervisor called Xen, which ended up being used by a lot of cloud providers, including Amazon.

After that, I did a couple of startups, one around Xen. Then I became a professor at UBC, teaching operating systems, networking, and security. Later, I did another startup in storage, and eventually I joined Amazon.

Now I have this highfalutin role—VP and engineer—working across S3, other storage services, and now a bunch of analytics services too. I get to cause trouble in lots of different parts of the cloud.

VP slash distinguished engineer—does that mean you just get to march around telling people how to improve their stuff?

People love that! I’d say about half the time I’m causing trouble—starting things and encouraging new ideas—and the other half I’m helping teams dig out from those ideas. Sometimes I take over a team if we’re doing something especially interesting or innovative, just so I can be closer to the action.

That sounds like a pretty good gig if you can get it.

It’s amazing. I’ve been here nearly eight years, and I still love this job.

The rise of virtualization and the origin of Xen

I want to talk about Xen. You said you were always interested in operating systems, which is kind of a niche fascination. What drew you in?

When I was a kid, we didn’t have much money, so I built computers from garage sale parts in Ottawa. In high school, I found this federal government warehouse that sold off old equipment. I started a little business buying pallets of hardware for cheap, fixing them up, and reselling.

It was chaotic—but I learned a lot. I dealt with machines like IBM DisplayWriters with 8-inch floppy disks and massive dot-matrix printers. Getting them working meant diving into their software and systems.

Eventually I played with Linux, hacked on the kernel, and that all led me into OS research and development.

Tristan: So what is a hypervisor, and why did virtualization become so important in the 2000s?

Andy: There were two big drivers: server utilization and isolation.

Companies had racks full of 1U servers, most of which sat idle most of the time. But they couldn’t share workloads because apps weren’t isolated well—config conflicts, shared resources, etc.

Virtualization allowed multiple operating systems to run on the same hardware, with isolation. It also let you consolidate servers, which had big cost and efficiency benefits.

There was also a technical challenge: x86 processors weren’t designed to be virtualized. That made it a really interesting research problem. We wanted to see if it could even be done—and done efficiently.

Tristan: And Intel eventually started building virtualization support into the hardware?

Andy: Exactly. Our work on Xen and similar projects showed it was possible. That pushed Intel and AMD to add features like VT-x, which made it easier and more performant to run hypervisors.

Tristan: How did AWS end up using Xen?

Andy: I wasn’t part of those internal conversations, but the story goes that a small startup in Cape Town, South Africa, was building a control plane for Xen. That team got picked up by AWS and became the basis for EC2.

Understanding Amazon S3

Tristan: Let’s switch to S3. I think a common mental model is that S3 is just a big pool of SSDs. But that’s clearly not the whole story. How do you explain what S3 actually is?

Andy: That’s one of my favorite questions.

Early on, S3 was like a storage locker. You’d rent space to stash things you didn’t need right away—backups, static files, CDN origins. Latency wasn’t great, but durability and availability were.

Things really changed when the Hadoop community built S3A—an adapter to let Hadoop use S3 instead of HDFS. Suddenly, we had people doing real analytics on S3. The system had enough drives to support massive parallel reads.

Today, workloads are way more demanding. Performance, consistency, and latency matter. We’ve been evolving the system constantly to meet those needs.

Tristan: Are we talking about billions of hard drives?

Andy: I can’t share exact numbers, but yes—it's a lot of hard drives. Some of our largest customers have data spread across millions of drives. And most drives are shared across multiple customers.

Tristan: And these aren’t SSDs?

Andy: Mostly spinning disks, actually. Hard drives are terrible at latency, but they’re cheap and good for bursty workloads. Spreading your data across many disks lets you take advantage of parallelism.

S3’s durability, performance, and scale

Tristan: Let’s talk about S3’s durability promise: 11 nines. How do you achieve that?

Andy: We use erasure coding—a form of RAID-like redundancy that lets you split data into parts and parity blocks. Then we store those shards across different availability zones.

We constantly monitor for failures. Disks die all the time, so we have fleets of processes repairing and maintaining durability. It’s not static. It’s a living system.

Tristan: You must have incredibly precise failure models.

Andy: We do. We track failure rates, temperature sensitivity, vendor behavior—everything. That allows us to be proactive and surgical in how we manage risk.

From Parquet to Iceberg to S3 table buckets

Tristan: I want to talk about table formats. Parquet is everywhere now. And then we got Hive Metastore, then Iceberg. Why did S3 launch table buckets?

Parquet is great, but it’s just files. Customers kept asking for more structured semantics: schema evolution, upserts, ACID transactions.

We saw Iceberg adoption grow rapidly—especially among our largest analytics customers. But they were struggling with operational complexity: too many small files, custom compactors, brittle catalogs.

So we launched S3 table buckets to bring native Iceberg support to S3. That includes:

Automatic compaction
A REST catalog
High-performance access

We wanted to make it easier to treat Iceberg as a storage primitive, not just an analytics backend.

So this is a shift in philosophy—S3 isn’t just object storage, it’s now table-aware?

Exactly. Historically, S3 was just where you stored objects. Now, we’re thinking more about what those objects mean.

We also launched S3 object metadata tables—a way to semantically describe and query your object store, especially useful for AI workloads using retrieval-augmented generation (RAG).

The future of open data and S3

What does the future of S3 look like? Where’s this going?

We’re headed toward more structure, more semantics, and more performance.

Inference workloads are scaling fast. AI models are hitting S3 hundreds of thousands of times per second to do vector lookups. That’s changing how we think about indexing, metadata, and latency.

We want to make S3 the best place to do open, flexible, high-scale data work—from tables to training data to retrieval.

Chapters

[01:42] Meet Andy Warfield

Andy shares his background, including startups, professorship, and his current role as VP & Senior Principal Engineer at AWS.

[05:10] From garage sales to hypervisors

Andy describes his early passion for hardware, OS development, and the origin story behind the Xen hypervisor.

[08:50] Why virtualization took off in the 2000s

Exploring why isolation, utilization, and technical curiosity fueled the rise of hypervisors.

[14:30] Xen vs. VMware and the road to AWS

How Xen became the default for EC2 and the technical differences between virtualization approaches.

[17:35] The origin of EC2 and S3

How a team from Cape Town helped launch AWS compute—and the early days of cloud services.

[20:00] What is S3, really?

Andy breaks down the mental model behind S3: not just object storage, but a scalable data platform.

[22:49] How many drives? More than you think

Why S3 storage spans millions of drives—and how AWS uses scale to deliver performance.

[28:10] The 11 nines durability model

Inside S3’s approach to reliability, failure tolerance, and background repairs using erasure coding.

[32:00] Tail latency and engineering for bursty workloads

Why slow requests matter, and how S3 teams optimize for streaming, AI, and analytics use cases.

[35:20] Iceberg, metadata, and table buckets

The emergence of Apache Iceberg as a table format—and AWS’s new structured storage approach.

[38:00] Why S3 added a REST catalog and compaction

How AWS is simplifying the operational burden of working with Iceberg at scale.

[40:00] A new mental model for object storage

S3 is no longer just about storing files—it’s about managing semantics, lineage, and trust.

[44:00] Looking ahead: S3, RAG, and semantic metadata

How S3 is preparing for the next wave of AI, inference, and context-aware applications.

[47:20] Is Iceberg ready for enterprise?

Andy shares thoughts on enterprise readiness, performance tradeoffs, and real-world adoption of table formats.

[49:05] Wrap-up and reflections

Tristan and Andy reflect on the conversation and where data infrastructure is headed next.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

From Docker to Dagger (w/ Solomon Hykes)

Dan Poppy — Sun, 22 Jun 2025 13:00:24 GMT

In this season of the Analytics Engineering podcast, Tristan is digging deep into the world of developer tools and databases. There are few more widely used developer tools than Docker. From its launch back in 2013, Docker has completely changed how developers ship applications.

In this episode, Tristan talks to Solomon Hykes, the founder and creator of Docker. They trace Docker’s rise from startup obscurity to becoming foundational infrastructure in modern software development. Solomon explains the technical underpinnings of containerization, the pivotal shift from platform-as-a-service to open-source engine, and why Docker’s developer experience was so revolutionary.

The conversation also dives into his next venture Dagger, and how it aims to solve the messy, overlooked workflows of software delivery. Bonus: Solomon shares how AI agents are reshaping how CI/CD gets done and why the next revolution in DevOps might already be here.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways

Tristan Handy: I want to get you to give a little background on yourself, where you've been, what you've been up to for the last couple decades. I think many people will know you as the person who kicked off an avalanche that changed how we interact with compute environments by inventing Docker?

Solomon Hykes: Docker is the thing I'm known for. Pre-Docker, I grew up in France. I studied programming in a French school called EpiTech. It was a brand-new, unconventional school where you learned through nonstop programming, which I loved.

Eventually, I got exposed to startups, despite being a complete outsider. I met someone who told me about them, and it stuck in my mind. Still in France at the time, I moved into my mom's house in the suburbs of Paris and worked out of the basement.

By complete luck, I got into an early version of Y Combinator in 2010. That got us on the path to what would become Docker three years later. In 2013, we pivoted to Docker from our previous company, dotCloud.

Tristan Handy: The original thing was called dotCloud, right?

Solomon Hykes: Yep. It was about container technology and its potential, but we didn't quite know how to take it to market. DotCloud was about deploying and hosting people's apps—platform as a service—competing with Heroku and many clones.

Tristan Handy: When did Heroku become a thing?

Solomon Hykes: I became aware of it in 2009. Just as I was struggling in France with container tech. When we joined YC in 2010, we packaged that tech into dotCloud, our hosting platform. Our differentiator was using containers under the hood when others didn’t. That let us support many language stacks and even run databases in containers—which was unheard of at the time.

Platform as a service was a tough business. Most startups went out of business or got acquired early. Eventually, we pivoted from selling the car to building an ecosystem around the engine—that became Docker.

Tristan Handy: Did you pivot because selling the car wasn't working? Or because people kept pointing at the engine saying, “Give me that”?

Solomon Hykes: Both. It was hard to market platforms. Developers expected free hosting, and hosting costs money. Margins were tight because of AWS. It always felt like pushing a boulder uphill. Meanwhile, people wanted to run things locally. There was no good ecosystem for that. Docker provided transparency, flexibility, and portability.

Tristan Handy: Can you define Docker and containerization, and how it differs from virtualization?

Solomon Hykes: Sure. Virtualization splits a physical machine into virtual ones using VMs—each with its own memory, compute, and storage. It gives flexibility, but with overhead.

Containerization does something similar but at the operating system level. Instead of virtualizing the machine, you split the OS itself. It’s mostly done with Linux, which can subdivide itself into isolated units. Containers are more lightweight, letting you run hundreds or thousands, unlike VMs where you might manage a handful before hitting limits.

Docker didn’t invent this, but we solved new problems with it.

Tristan Handy: I remember creating my first Docker container around 2015. I expected a slow boot-up like a VM, but it was instantaneous. Where is the OS in that setup?

Solomon Hykes: Great question. Docker relies on Linux. When you're on a Mac, it runs Linux behind the scenes—today via virtualization. Back then, we used lots of early, rough tools and kernel patches to make Linux containers work. Docker put all the pieces together in a coherent way.

Tristan Handy: So containerization wasn’t new, but Docker made it accessible?

Solomon Hykes:
Exactly. The Linux kernel had features like namespaces and cgroups—building blocks for containers. But they weren’t user-friendly. We made a developer-centric abstraction on top of those tools.

And Linux provided a massive compatibility layer. Unlike Java, which required writing your app in Java, Docker containers could wrap apps written in any language, as long as they ran on Linux.

Tristan Handy: So Docker is like infrastructure as code—a primitive that enables the whole concept?

Solomon Hykes: Yes! And because we wanted ubiquity, we avoided pushing too many opinions. We let developers build on top of it in many different ways. That’s what helped Docker become a de facto standard.

Tristan Handy: How fragmented is the Linux world under the hood? Did you have to do much abstraction work?

Solomon Hykes: We were lucky. The Linux kernel is extremely stable and consistent. But everything above it—distros, package managers, tooling—was chaotic. That chaos created the opportunity for Docker to provide a consistent experience.

Tristan Handy: Were there any drawbacks? Like “Docker sprawl” the way VMware saw VM sprawl?

Solomon Hykes: Definitely. With power comes chaos. Teams would run dozens of Docker containers, each configured differently. Docker doesn’t enforce opinions—by design.

Tristan Handy: And what happened when you left Docker in 2018?

Solomon Hykes: I took time off, became a full-time dad. But I also realized how many unsolved problems remained. Especially around CI/CD pipelines and software delivery—what we now call the software factory.

That led me to start Dagger.

Tristan Handy: So Dagger is like “containers for pipelines”?

Solomon Hykes: Yes. Just as Docker standardized app deployment, Dagger aims to standardize and containerize software delivery. CI/CD pipelines today are often duct-taped together with YAML and bash scripts. We’re bringing consistency and modularity to that space.

Tristan Handy: Will there be a “Daggerfile” like there’s a Dockerfile?

Solomon Hykes: Sort of. But this time, we’re opinionated. Dagger is narrowly focused on CI/CD. That lets us provide APIs, SDKs, and a deeper abstraction stack. We give platform engineers a DAG-based system to define repeatable, containerized steps.

Tristan Handy: And what’s the role of AI and agents in all this?

Solomon Hykes: Great question. We didn’t plan for it, but our community showed us the way. People started building AI agents that run in Dagger pipelines—automating things like writing tests, submitting PRs, and optimizing builds.

That blew our minds. Agents blur the line between development and delivery. They need programmable environments. Dagger is becoming an ideal platform for that.

Chapters

01:30 – Early Days: From France to dotCloud

Solomon shares how his early programming experience and startup journey led to the creation of dotCloud.

04:00 – The PaaS Struggle and Birth of Docker

The team pivots from platform-as-a-service to focusing on the container engine itself—what would become Docker.

07:00 – What Is a Container, Really?

Solomon explains containerization vs. virtualization in plain terms and why it changed the game for developers.

11:00 – The Developer Experience That Won the World

The magic of fast, lightweight Docker containers—and how that first “wow” moment felt.

14:00 – Building a Ubiquitous Standard

Why Docker stayed narrow by design, resisting feature bloat to maximize compatibility.

18:00 – DevOps Before DevOps

How Docker avoided language tribalism and achieved mass developer adoption by choosing Go and CLI-first tooling.

21:00 – Complexity and Container Sprawl

Docker made infrastructure easy—but created new operational challenges at scale.

24:30 – Why CI/CD Pipelines Are Still Broken

Solomon outlines the gap Docker never got to fix: modern software delivery remains brittle and ad hoc.

27:00 – Enter Dagger: DevOps for the Modern Age

How Solomon’s new company is treating pipelines as composable software, not brittle scripts.

30:00 – Building an OS for the Software Factory

Dagger helps platform teams manage the complexity of software delivery with reusable, testable components.

33:00 – Agent-Native Workflows: A Surprise Use Case

AI agents begin using Dagger to reason about pipelines, generate code, and submit pull requests autonomously.

37:00 – Reimagining the Dev Loop with AI

Why the boundary between development and CI/CD is collapsing—and how Dagger fits the agent-powered future.

41:00 – Scaling Trust in Delivery

Tristan and Solomon reflect on how developer tooling evolves and what a stable, fast delivery layer enables.

45:00 – Final Thoughts: What’s Next for DevOps

The conversation closes with predictions on intelligent automation, composability, and the future of platform engineering.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The history and future of the data ecosystem (w/ Lonne Jaffe)

Dan Poppy — Sun, 08 Jun 2025 13:02:46 GMT

In this decades-spanning episode, Tristan talks with Lonne Jaffe, Managing Director at Insight Partners and former CEO of Syncsort (now Precisely), to trace the history of the data ecosystem—from its mainframe origins to its AI-infused future.

Lonne reflects on the evolution of ETL, the unexpected staying power of legacy tech, and why AI may finally erode the switching costs that have long protected incumbents. The future of the AI and standards era is bright.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Episode chapters

00:46 – Meet Lonne Jaffe: background & career jurney

Lonne shares his career highlights from Insight Partners, Syncsort/Precisely, and IBM, including major acquisitions and tech focus areas.

04:20 – The origins of Syncsort & sorting in mainframes

Discussion on why sorting was a critical early problem in hierarchical databases and how early systems like IMS worked.

07:00 – M&A as innovation strategy

How Syncsort used inorganic growth to modernize its platform, including an early example of migrating data from IMS to DB2 without rewriting apps.

09:35 – Technical vs. strategic experience

Tristan probes Lonne’s technical depth despite his business titles; Lonne shares his background in programming and a fun fact about juggling.

11:55 – Why this history matters

Tristan sets up the key question: what lessons from 1970s-2000s ETL tooling still shape the modern data stack?

13:00 – Proto-ETL: The real OGs

Lonne traces the origins of ETL to 1970s CDC, JCL, and early IBM tools. Prism Solutions in 1988 gets credit as the first real ETL startup.

15:40 – Rise of the ETL market (1990s)

From Prism to Informatica and DataStage—early 90s vendors brought visual development to what was once COBOL-heavy backend work.

18:00 – Why people offloaded Teradata to Hadoop

Exploring how cost, contention, and capacity drove ETL out of the warehouse and into Hadoop in the 2000s.

20:00 – Performance vs. price: Jevons Paradox in ETL

Why lower compute and storage costs led to more ETL, not less—and how parallelization changed the game.

22:30 – Evolution of data management suites

How ETL expanded into app-to-app integration, catalogs, metadata management, and why these bundles got bloated.

25:00 – Rise of data prep & self-service analytics

Tools like Kettle, Pentaho, and Tableau mirrored ETL for business users—spawning a whole “data prep” category.

27:30 – Clickstream, logs & big data chaos

How clickstream and log data changed the ETL landscape, and the hope (and letdown) of zero-copy analytics.

29:10 – Why is old software so sticky?

Tristan and Lonne explore the economics of switching costs, the illusion of freedom, and whether GenAI could break the lock-in.

33:30 – Are old tools actually… good?

Defending mainframes and 30-year-old databases like Cache. Sometimes the mature option is better—just not sexy.

36:00 – The new vs. the durable

Modern tools must prove themselves against decades of reliability and robustness in finance, healthcare, and compliance.

38:20 – GenAI in data: The early movers

Lonne highlights why companies like Atlan and dbt Labs are in the best position to win—distribution, trust, and product maturity.

41:00 – TAM and the Jevons Paradox, again

Revisiting how price drops expand TAM. Some categories vanish, others explode—depending on elasticity of demand.

43:15 – Unlocking new personas with LLMs

Structured data access for non-technical users is finally viable, but “it has to be right”—trust and quality remain the barrier.

46:00 – Real-world examples: dbt’s MCP server win

Tristan shares how dbt’s Metadata API became a catalog replacement for a traditional financial institution—an unplanned AI GTM success.

48:30 – Agents, not interfaces

New pattern: LLMs as agents interacting directly with infrastructure via APIs. Tool use is becoming table stakes for AI integration.

50:30 – Are LLMs birthright tools yet?

Discussion around adoption of ChatGPT Enterprise, Claude, etc. Lonne suggests adoption is accelerating fast—and the usage model matters.

52:00 – Looking ahead

The conversation ends with a reflection on GenAI’s near future in data workflows, TAM expansion, and what the next episode might tackle.

Key takeaways from this episode

Tristan Handy: You've had a long career in tech. Maybe start by giving us the 30,000-foot view of what you've been up to over the last couple decades?

Lonne Jaffe: I’ve been at Insight Partners for about eight years now, working mostly on deep tech investments—AI infrastructure companies like Run AI and deci.ai, both acquired by Nvidia. I’ve also done work with data infrastructure companies like SingleStore. Before Insight, I was CEO of a portfolio company called Syncsort, now Precisely. It was founded in 1968.

Prior to that, I was at IBM for 13 years, working in middleware and mainframe technologies. Products like WebSphere, CICS, and TPF—foundational systems for enterprise computing.

Tristan Handy: And Syncsort's origin was in sorting, right? Literally sorting files?

Lonne Jaffe: Exactly. In the early days of computing, sorting was a huge part of what you did. Much of the data was hierarchical—stored in IMS—and had to be flattened into files to process. The algorithms were optimized to run in extremely resource-constrained environments.

Tristan Handy: Fascinating. And I assume as compute and storage improved, the data integration landscape evolved?

Lonne Jaffe: Yes. We saw a move from hierarchical to relational databases, then toward ETL tools in the 80s and 90s. The first real ETL startup was probably Prism Solutions in 1988. Informatica and DataStage showed up in the early 90s, followed by Talend and others.

Tristan Handy: It seems like we got a whole bundle of tools over time—ETL, CDC, app integration, metadata, and so on.

Lonne Jaffe: Yes, often bundled together, even though data prep and app integration were treated separately. That persisted for longer than you'd expect. At Syncsort, we acquired a company with a "transparency" solution that allowed IMS applications to use data stored in DB2 without rewriting code—a clever way to manage switching costs.

Tristan Handy: Speaking of switching costs—why are these legacy tools so sticky?

Lonne Jaffe: Great question. In many cases, no customer loves the product. They’d switch in a heartbeat—if it were easy. But rewriting jobs and ensuring reliability is a heavy lift. The best outcome is a new system that replicates old functionality. And for many organizations, that’s not worth the risk.

Tristan Handy: But if generative AI could reduce those switching costs?

Lonne Jaffe: That’s the potential. Code generation, agents that explore and iterate—those could erode the moat that’s protected these incumbents for decades. Not tomorrow, but it’s a real possibility.

Tristan Handy: It also seems like some of these systems are more robust than people give them credit for.

Lonne Jaffe: Absolutely. Mainframes are IO supercomputers. Products like InterSystems Cache, used by Epic, are incredibly performant. But new systems must match or exceed those capabilities in reliability and scale, which is a high bar.

Tristan Handy: As you look at the evolution of the modern data stack, how do you think about its impact on the market?

Lonne Jaffe: In the 2010s, we saw disaggregation—tools like Fivetran, dbt, and Snowflake each tackled a slice of the old enterprise bundle. But the TAM isn’t infinite. Some categories may compress or vanish entirely if price drops aren’t offset by new demand.

Tristan Handy: Do you think AI expands or compresses the data stack?

Lonne Jaffe: It depends. High elasticity of demand—like with dashboards or analytics—can drive massive TAM expansion. But some categories, like logo redesign or simple data movement, might get commoditized. For more complex workflows, AI agents accessing platforms like dbt or Atlan could dramatically increase value by automating common tasks and enabling new personas.

Tristan Handy: We’ve seen an example already—a customer replaced their data catalog with our dbt Cloud metadata server and AI interface.

Lonne Jaffe: That’s telling. If AI interfaces can connect to tools like dbt and generate value—self-service, documentation, lineage—it changes the game. Especially for organizations already standardized on those platforms.

Tristan Handy: What’s your view on how these AI interfaces get distributed?

Lonne Jaffe: ChatGPT Enterprise, Claude, and others are spreading fast. Eventually, you’ll want those tools to search files, access internal metadata, and interact with your data stack—not just answer questions from the open web.

Tristan Handy: It makes a lot of sense. If AI is going to serve enterprise users, it needs access to the real data. Otherwise, it’s just a toy.

Lonne Jaffe: Exactly. A model that can’t query or verify against your actual environment won’t be reliable. And data quality and observability—something dbt Cloud is already good at—become foundational.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Everything terminals (w/ Zach Lloyd)

Dan Poppy — Sun, 25 May 2025 13:01:47 GMT

In this episode, Tristan talks with Zach Lloyd, founder of Warp—a terminal built for the modern era, including for AI agents. They explore the history of terminals, differences between terminals and shells, and what the future might look like. In a world driven by generative AI, the terminal could once again be the control center of computer usage.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.

Listen & subscribe from:

Chapters

01:00 – Introducing Warp and Zach Lloyd
- Zach Lloyd explains Warp's origin, mission, and initial vision.
02:40 – Why redesign the terminal?
- Zach describes why traditional terminal UX was ripe for reinvention.
04:43 – Enter LLMs: A new direction for Warp
- Warp evolves into a natural language interface for developer workflows.
06:34 – What is a shell?
- Zach defines shells, how they process text, and their role in the CLI ecosystem.
07:58 – Shells vs programs vs built-ins
- Distinguishing between shell commands and standalone programs.
10:00 – Why do developers debate shells?
- Features, syntax, and licensing behind the Bash vs Z Shell discussion.
12:17 – Why terminals still matter
- The enduring power of text-based computing and scripting.
16:40 – What is a terminal, really?
- Clarifying the difference between terminal hardware, emulators, and modern terminal apps.
20:13 – The Warp interface
- Zach breaks down Warp’s UI: input editor, output blocks, and mouse support.
22:48 – Will Warp replace your IDE?
- The vision of AI-driven development and the convergence of terminal, editor, and chat.
27:20 – Rethinking development interfaces
- Finding the ideal hub for AI-native software development.
35:00 – Why the terminal has an edge
- Advantages of the terminal for cross-project, full-lifecycle developer tasks.
37:10 – Bottom-up adoption strategy
- How Warp approaches growth: focus on individual developers, not top-down mandates.
39:50 – Is Warp redefining the terminal?
- The challenges of innovating in a legacy-dominated space and creating a new category.
42:45 – Developer control & context in Warp
- Customization, context-awareness, and MCP integration in Warp’s AI tooling.
46:32 – Closing reflections
- Zach and Tristan wrap up their thoughts on the future of terminals, AI, and developer tools.

Key takeaways from this episode

Tristan Handy: Can you tell us about Warp, where the idea came from, and where you’re at today?

Zach Lloyd: Warp reimagines the command line to make it more approachable, powerful, and useful for developers. I've been a software engineer for over 20 years and always used the terminal, but never understood why it worked the way it did. I used to learn the minimum I needed and rely on team members when I ran into issues.

After my last startup, I looked at tools I used frequently that could have a big impact if improved. The terminal stood out. I realized better UX—like being able to use a mouse to position the cursor or select output for copy-paste—could unlock a lot of productivity. That was the initial idea about five years ago.

We spent the first couple of years redesigning the interface. Today, Warp is more than a terminal—it's a natural language interface to the command line, powered by large language models (LLMs). You can use it to set up projects, write code, debug production, and more.

Tristan: I want to dig into fundamentals. Can you define what a shell is?

Zach: A shell is a program that parses text input, runs commands, and returns text output. You can run it interactively or through scripts. Terminals, by contrast, are the graphical layer that displays text and captures keyboard input. Shells like Bash, Z Shell, and Fish offer different features, syntaxes, and configurations. Some programs like cp are shell built-ins, which don’t require forking new processes.

Tristan: Why do terminals persist in a GUI-dominated world?

Zach: A few reasons. First, it’s easier to write command-line apps than GUI apps. Second, the interface is infinitely flexible—you can pass endless flags and parameters. Third, command-line programs interoperate cleanly via text streams. And lastly, they’re scriptable. Developers can automate repetitive workflows easily, which is powerful.

Tristan: So a terminal just runs a shell. But I never think of terminals as having features. What makes a terminal more than a simple interface?

Zach: Terminals emulate old hardware—keyboards and text displays. Today’s terminal apps are GUI shells that simulate this behavior. Most are "dumb terminals," just rendering characters. But they can support features like theming, control characters for advanced UI (e.g., in Vim), and even bitmap rendering.

Tristan: Warp looks very different. Can you describe it?

Zach: Warp looks more like a chat or notebook interface. Each command's output is grouped in a logical block instead of being dumped in a scroll. The input area behaves more like a code editor, with syntax highlighting and first-class mouse support. We're aiming for modern UX.

Tristan: So you're blending terminal, editor, and chat. Will people eventually write all their code in Warp?

Zach: My vision is that developers will increasingly describe what they want in natural language, and agents will do the work. Developers supervise the results. That interface needs to support managing many tasks at once. That’s what we’re building towards. It won’t even be called a terminal—it’s a new category of software.

Tristan: The boundaries between these tools are blurring. And maybe the best interface for AI-assisted development isn't an IDE or chat app—it could be the terminal.

Zach: The terminal spans all phases of development—from setup to deployment and debugging. It also supports cross-project work, which IDEs don’t. That’s a huge strength.

Tristan: But terminals are a personal choice. How do you think about adoption and your business model?

Zach: Like editors, terminals are developer-choice tools. We don’t go top-down. Our motion is bottoms-up: get individuals to love Warp, then expand into teams and enterprises for security, privacy, and data controls.

Tristan: Are you trying to reset the baseline for what a terminal is?

Zach: We're not open source, though we’ve considered it. It’s risky. But our focus isn’t on redefining "the terminal." It’s on building the best tool for developers to ship software. That might require a new category name.

Tristan: What’s the dev experience in Warp like? Is it customizable?

Zach: We support theming and shortcuts. But the most important part is AI context. Warp can use any CLI tool to gather context—GitHub CLI, GCloud, etc. We’re also implementing the Model Context Protocol (MCP) and plan to better support custom/internal tools as well.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Why compilers matter (w/ Lukas Schulte)

Dan Poppy — Mon, 12 May 2025 12:02:26 GMT

Tristan Handy dives deep into the world of compilers in this episode of The Analytics Engineering Podcast with Lukas Schulte, cofounder of SDF Labs (not to be confused with last episode’s guest—Lukas’ dad and fellow SDF cofounder Wolfram Schulte). Tristan and Lukas discuss what compilers are, how they work, and what they mean for the data ecosystem. SDF, which was recently acquired by dbt Labs, builds a world-class SQL compiler aimed at abstracting away the complexity of warehouse-specific SQL.

The conversation covers the evolution of compiler technology, what software engineering has gotten right over the past several decades, and why the data ecosystem is poised for similar transformation. Lucas and Tristan explore why SQL has lagged behind other programming ecosystems, and how new compiler infrastructure could lead to package management, interoperability, and greater innovation across data platforms. It’s a fascinating (and timely) episode: Get ready for the new dbt engine.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.

Listen & subscribe from:

Chapters

02:40 The vision behind SDF Labs
04:00 What is a compiler?
05:00 Components of a compiler: frontend, IR, backend
08:00 Syntax vs. semantics and the role of parsing
10:00 Logical vs. physical plans in SQL compilers
13:00 Historical context: mainframes to LLVM
16:00 Cross-architecture portability in Rust & other compilers
18:00 What is LLVM and why it matters
20:00 Bootstrapping and the self-recursive nature of compilers
21:00 Compilers in Java, TypeScript, and dbt
23:00 Why compilers are foundational to software ecosystems
26:00 The SQL dialect problem in data warehouses
29:00 Can SQL get its own LLVM?
31:00 How Substrate and DataFusion aim to standardize SQL
35:00 Package management and the path toward SQL abstractions
38:00 The future of the data ecosystem with a common SQL compiler

Key takeaways from this episode

What is a compiler?

Tristan Handy: What is a compiler?

Lukas Schulte: It's something that takes higher-level human-readable code and translates, compiles, rewrites it into lower-level machine code that is much harder for humans to understand and much easier for machines to understand.

Compilers typically have phases. They have a frontend that deals with the language you're working with, a middle component—usually called an IR or intermediate representation—and a backend that takes that IR and compiles it into machine code.

Compiler phases: frontend, IR, backend

Tristan Handy: How does it all come together?

Lukas Schulte: There’s a preprocessor that handles macros, removes comments, and prepares the text. Then a lexer converts it into tokens. These tokens get assembled into a tree that the compiler can understand. That’s where syntax validation and semantic analysis happen.

From there, we build a logical representation of the operations we want to perform. That transitions to a physical plan, which starts considering the hardware: how many cores, how much memory, which files we’re accessing. After that, optimizations are applied and it compiles to actual machine code using a toolchain like LLVM.

Syntax vs. semantics

Lukas Schulte: Let’s break down syntax vs. semantics.

Imagine the code x = x + 1. That has valid syntax. Its meaning—its semantics—is that we’re incrementing x by 1.

Now, you could also write x += 1. Different syntax, same semantics. So syntax defines structure, and semantics define meaning. That distinction is important when you’re analyzing or transforming code.

LLVM and portability

Tristan Handy: Have we been building abstraction layers like this for decades?

Lukas Schulte: Absolutely. That’s what LLVM does. It provides a consistent intermediate representation that compilers can use to target multiple backends—Intel, ARM, different OSes. Apple invested early in LLVM to support custom chips.

With Rust, for example, LLVM is what lets us build binaries that behave the same on macOS, Windows, and Linux with relatively little effort.

Bootstrapping compilers

Tristan Handy: So there’s this recursive loop—compilers being built with other compilers?

Lukas Schulte: Exactly. Rust wasn’t always written in Rust—it started in C++. Eventually, the compiler was rewritten in Rust itself. Now, Rust compiles Rust. It’s fully self-hosted. That’s common with mature languages—it shows the compiler ecosystem is stable and powerful enough to sustain itself.

Why compilers matter

Tristan Handy: You said once that compilers are the foundation of every software ecosystem. What did you mean?

Lukas Schulte: There are two big drivers in software: abstractions and standards. You want one way to interface with a USB device—not ten. Same for software. You want one standard way to express a Python program, a JavaScript app, etc.

Compilers enforce those standards and make sure the same code works across platforms. That consistency powers things like package managers, shared libraries, and open ecosystems.

SQL dialects and fragmentation

Tristan Handy: Are there ecosystems that are doing worse than others?

Lukas Schulte: SQL does a particularly bad job. Anyone who's used more than one data warehouse knows you can't take the same SQL statement and expect it to work the same way. Casting, case sensitivity, functions—every engine handles these things differently.

Toward a universal SQL compiler

Tristan Handy: Can you convince me this problem is solvable?

Lukas Schulte: Yes. That's what we're working on with SDF—creating a shared intermediate representation for SQL. If we can express SQL logic in a unified form, we can compile it to any dialect—BigQuery, Snowflake, Redshift, and so on.

That allows developers to build reusable libraries, just like in other languages. It also makes governance, validation, and testing easier.

Future of data ecosystems

Tristan Handy: What would that future look like for practitioners?

Lukas Schulte: One major change would be the emergence of robust SQL libraries. Today, there’s no import system for SQL. Everyone writes similar logic over and over.

A shared compiler abstraction would let us reuse components, collaborate across companies, and build an ecosystem of packages for transformations, metrics, and validations—similar to how we use NPM or PyPI.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The evolution of databases (w/ Wolfram Schulte)

Dan Poppy — Mon, 28 Apr 2025 12:02:43 GMT

Summary

Welcome to our new season of The Analytics Engineering Podcast. This season, we’re focusing on developer experience. We’ll explore the developer experience by tracing the lineage of foundational software tools, platforms, and frameworks. From compilers to modern cloud infrastructure and data systems, we’ll unpack how each layer of the stack shapes the way developers build, collaborate, and innovate today. It’s a theme that lends itself to a lot of great conversations on where we’ve come from and where we’re headed.

In our first episode of the season, Tristan talks with Wolfram Schulte. Wolfram is a distinguished engineer at dbt Labs. He joined the company via the acquisition of SDF Labs Labs, where he was co-founder and CTO. He spent close to two decades in Microsoft Research and several years at Meta building their data platform.

One of the amazing things about Wolfram is his love of teaching others the things that he's passionate about. In this episode, he discusses the internal workings of data systems. He and Tristan talk about SQL parsers, compilers, execution engines, composability, and the world of heterogeneous compute that we're all headed towards. While some of this might seem a little sci-fi, it’s likely right around the corner. And Wolfram is inventing some of the tech that's going to get us there.

Join Tristan May 28 at the 2025 dbt Launch Showcase for the latest features landing in dbt to empower the next era of analytics. We'll see you there.

Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Chapters

01:35 Introduction to dbt Labs and SDF Labs collaboration
04:42 Wolfram's journey from monastery to tech innovator
07:55 The role of compilers in database technology
11:05 Building efficient engineering systems at Microsoft
14:13 Navigating data complexity at Facebook
18:51 Understanding database components and their importance
24:44 The shift from row-based to column-based Storage
27:40 Emergence of modular databases
28:44 The rise of multimodal databases
30:45 The role of standards in data management
35:04 Balancing optimization and interoperability
36:38 Conceptual buckets for database engines
38:46 DataFusion compared to DuckDB
40:44 ClickHouse
44:20 Bridging the gap between SQL and new technologies
50:55 The future of developer experience

Key takeaways from this episode

From monastery to Microsoft: Wolfram’s journey

Tristan Handy: Can you walk us through the Wolfram Schulte origin story?

Wolfram Schulte: I was born in rural Germany—Sauerland—and ended up in a monastery boarding school after my father passed away. Their goal was to train monks and priests, but that didn’t stick for me.

Later I went to Berlin—back then you had to cross East Germany to get there—and began studying physics. But I realized everyone else understood physics better than I did! One day I walked past a lecture on data structures and algorithms, and I was hooked. I hadn’t written a line of code at that point, but I switched to computer science immediately.

After my PhD in compiler construction, I joined a startup, then landed at Microsoft Research in 1999 thanks to a chance encounter with the logician Yuri Gurevich.

Inside Microsoft Research and Cloud Build

At Microsoft Research, we were like Switzerland—neutral across teams like Office, Windows, and Bing. We’d invent tools and ideas, but often the business units didn’t trust them. That changed when I was asked to build an engineering org.

We created Cloud Build, a distributed build system like Google’s Bazel. It reduced build times from hours to minutes and had a huge impact on iteration speed, productivity, and even morale. People stayed in flow. Builds were faster, cheaper, and smarter—running mostly on spare capacity.

Janitorial work at Meta: cleaning up big data

You later joined Facebook (Meta). What was that like?

A different world. No titles for engineers. Egalitarian, fast-moving. I joined to clean up the data warehouse—what they called “janitorial work.” At Meta, each type of workload had its own engine: time-series, batch, streaming, etc. This made understanding lineage and dependencies across systems extremely hard.

We responded by building UPM, a SQL pre-processor that stitched metadata across engines. It became part of Meta’s privacy infrastructure and compliance tooling, especially after the fallout from Cambridge Analytica.

Databases as compilers

Let’s shift gears. Can you walk us through how analytical databases actually work—like a professor at a whiteboard?

Sure. Think of a database like a compiler:

Parsing & analysis: Is the SQL valid? Are the types correct?
Optimization: SQL is declarative, so you can reorder joins, push down filters—based on algebraic laws like associativity.
Execution: Often done in parallel, especially in modern warehouses.
Storage: Columnar vs. row-based; optimized formats like Parquet or ClickHouse’s custom format.

Historically, storage and compute were bundled. Now they’re decoupled. But when the engine understands the format deeply, performance is much better.

The rise of modular and composable data platforms

How did we get from monolithic systems to the composable database architectures we have today?

It started with the rise of big data—Hadoop, HDFS, MapReduce. That decoupled compute from storage. Columnar formats like Parquet enabled analytical workloads. Then came Iceberg, Delta Lake, and similar standards that enabled multiple engines to share data.

Modern databases are modular. For example, Postgres is transactional, but you can bolt on an OLAP engine for analytical queries. You can mix and match based on your workload. The result is a data ecosystem that’s far more flexible—but also more complex.

Engine families: Snowflake, DuckDB, ClickHouse

Can you help us bucket the different kinds of engines out there?

Totally. Here are three buckets:

Cloud-native engines: Snowflake, BigQuery. They’re optimized for massive scale, often with their own proprietary storage.
Embedded/single-node engines: DuckDB, DataFusion. Great for local dev or embedded analytics. DuckDB is for users; DataFusion is for database builders.
Real-time/high-throughput engines: ClickHouse, Druid. Tuned for streaming and extremely fast aggregations.

Each has its trade-offs. Increasingly, projects are combining these. For example, you can plug DuckDB or DataFusion into Spark to speed up leaf-node execution. The whole engine space is getting more composable—and more interchangeable.

The role of SDF in dbt’s future

If you think about the future where SDF is fully integrated into dbt Cloud, what does that enable?

Initially, it might feel the same—but faster, smarter. Longer-term, we can give developers superpowers.

Imagine your dev environment proactively surfaces:

“This data looks different than yesterday—want to investigate?”
“You’re missing a metric that’s often used alongside this one.”
“This join will behave differently on engine X—here’s what to change.”

That’s the kind of intelligent, predictive developer experience we’re building. We’re catching SQL up to what IDEs have done for code. And if we can make logical plans portable across engines, dbt becomes the consistent interface across heterogeneous compute.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Building a data team from the beginning (w/ Daniel Avancini)

Dan Poppy — Sun, 26 Jan 2025 14:03:38 GMT

Daniel Avancini is the chief data officer and co-founder of Indicium—a fast-growing data consultancy started in Brazil.

There are a lot of data consultancies around the world, and a lot of them do great work. What has been so fascinating about Indicium’s journey is their HR model. Rather than primarily hiring experienced professionals, they decided to go hard on training. They built a talent pipeline with courses and an internal onboarding process that takes new employees from zero to 60 over a few months.

The result has been phenomenal and Indicium delivers great client outcomes, but most importantly, they're building skills for hundreds of brand new data professionals.

Data is a hard field to break into because fundamentally you can't do the real thing unless you have access to data. So any company investing in building scalable hiring and training processes for analytical talent is one to be excited about.

This is our last episode of the season. We’ll be back very soon. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways from this episode

Can you give a little bit of an introduction to you and to Indicium?

Yeah, sure. So I'm the co-founder and CDO of Indicium. We're a data consultancy. Now we're based in New York, but we also have a presence in Latin America and Brazil where we started. We mostly focus on the data stack and new data stack tools. We've been helping business companies use modern data platforms and move to new data stack tools, including dbt, for about seven years. We are a young company, but not that young in the modern data stack world.

Tell me about your journey starting in Latin America to expanding internationally.

We started in a small city in Brazil called Florianopolis. It's like a tech center, like San Francisco. There are many new companies there, but it's not a business data consulting space. We really started from the beginning; we started with smaller, mid-sized regional companies, really trying to find something that made sense. So we pivoted a lot in the beginning on how we could deliver value.

You skipped the step “we wanted to start a company.” What was the original idea?

I was working for a startup in agricultural hardware machinery. Nothing related to data or services in general. My cofounder was managing a surfboard manufacturing factory.

That is wild. I love that. You can come to data from so many different backgrounds, including surfboard manufacturing.

It's a beach town, so there's a lot of surfing there. We realized at the beginning that there were a lot of technology platforms for marketing analytics, data intelligence, SaaS tools. But when we talked to anyone that was making decisions, no one was really using that data.

Our first insight was there’s a need for someone in this market to bridge the gap and bring all this really great data to companies in a more organized way and in a value-driven way. At first that was our goal.

We were not focused on building a data platform consultancy. But as we grew, we found out that it's harder than we thought. We needed to do a lot of foundational work, especially on smaller companies. All they had were Excel spreadsheets. Databases, SQL databases, and a lot of Excel spreadsheets, and a lot of the complex analysis we wanted to do, was just not ready. We helped them build platforms and foundations for these companies.

And it took off like a rocket ship? What's a sense of your scale that you want to share?

Yeah, we're at around almost 400 employees right now. So we're pretty big for this market.

What has allowed you to become successful at the scale that you have been?

I think what really helped us scale is that since the beginning, we have really focused on building our own teams and our own capabilities to scale. Even before we started using dbt or any of the modern data stack, we already thought, because we are in a smaller market, we couldn't compete for data engineering talent.

So at this age in Brazil, probably the same in the US, it was a very competitive market for data engineering in general. There was a data engineering talent pool in Brazil, but it was expensive.

But there were a lot of training programs on the internet. There are a lot of data camps, Udemy, Coursera. There's so much good stuff there. But maybe there's a lack of curation, right? People want to work in this area, but how do they start? What do they have to do? So we really focus on building that talent pool, our own talent pool right from the beginning. We were at the university just bringing good people, good talents for engineering, from economics, from business. “Hey guys, do wanna work with data?” Look at this program; it's free. Just go there and train. And then we would hire maybe one or two of the best ones. We would bring like 10 or 15 people. They would come to the program, we would hire the two or three best ones.

Maybe four years ago we were starting to grow faster and we needed more people. We needed a more stable source of talent.

First, we built our own analytics engineering course. We found dbt and realized this is the way we grow because we don't need to hire experienced data engineers. We can hire experienced marketing analysts and train them.

It's such a consulting hack, right? I've been excited to hear your story because I think it is so parallel to our own story. We were doing a similar thing in that we were hiring people with no data experience.

We can grow much faster because we can hire analysts in general. We can train anyone. It was so hard to train Airflow and Spark at that time. But if we use dbt, we can just teach these guys how to work with data analytics. And so I built the Analytics Engineering Formation, our first course. And what we did in this course wasn't only dbt. We trained about dimensional modeling, ETL, a lot of foundational analytics work that we weren't seeing when we're trying to hire people.

But everyone at that time wanted to be a data scientist. But that's not the work. For every data science, you're going to find 30 data analytics engineers because there’s so much more work with analytics.

We've trained more than a thousand people with this course in the past four or five years. And a lot of those people, a lot of these talents we hired, so they would do the course and then we're like, yeah, we have an open position. Do you want to work for us? And so we started really hiring from this course for the analytics engineering profession.

And it really worked. And we still use the same course today for our own team, Everyone has to do the course so they understand what we do. There's a practical exam, so they need to build their own data warehouse with dbt by themselves.

My best guess is that there's probably a million or so humans in the world that have used dbt pretty regularly. In the grand scheme of things, that’s not a big number when you're a giant consulting organization and you have a huge hiring pipeline. Building a practice that puts dbt at the center of it can work really well, but you have to really build the business model around it.

But I would argue dbt is not the only one. If you think about data science, about data engineering, all these other data professions, it's really hard because there's no undergrad. People don't graduate on airflow engineering. Everything they use at work, they learn after they start working.

Yeah. And so the point is you have to build a talent pipeline that teaches people how to do the stuff as opposed to expecting it to already exist.

One of the things that people don't fully understand, unless they've been through this journey, is that it is an unbelievable level of investment to do what you've done. Consulting businesses don't generally raise a ton of venture money. It's a real strategic investment, but two, it's a real risk.

If for some reason this doesn't work, that's a giant problem for Indicium, I would imagine. And that means that if you're going to make this type of investment, you have to feel like you have control over the technology that you're choosing. I imagine that it would be very hard for you to make this type of investment in something that was not not open source. Is that a true statement?

Probably yes. Especially because a lot of these tools, I can't really pay for the tool when I'm educating and when I'm teaching. Maybe if I have a partnership, but yes, for a lot of the work we needed to use some kind of open-source tool for this work.

That makes total sense. I didn't even think about it from a seat's perspective. Let's say that you were gonna use Amplitude or something like that. You would have to figure out how to get whatever, 100 people per semester access to Amplitude and that requires partnership.

And also we had to build our own courses because if I needed to use market courses like Udemy, I would have to pay for all these courses for all of these students and then it becomes too expensive. So we had to invest a lot of our time just building our own training programs and our own training.

What other tools did you incorporate into the standard training?

So what we did after a few years is instead of just training dbt and analytics engineering, we created another program we call the Lighthouse program. When we open positions for analytics engineering, we get all kinds of people just because they are engineers. Then we're like, what kind of engineering? “I'm a chemical engineer.” Okay, but do you know anything about data? “No, but I'm an engineer.”

We had so much work on teaching because it's such a new market. People don't know what the work is. A lot of the undergrads. They still don't understand what an analytics engineer does. So the idea of the program was be a lighthouse. Like, I'm going to show you the best career for you.

After a person joins the program, we're going to tell you, you're going to be a data engineer because of their competencies.

It's like the sorting hat in Harry Potter. And is that about skill sets or personality or interests or what?

Yeah, I really looked into skill sets, personality, and we did some personality traits tests.

Okay, so tell me what's the personality of a data scientist versus an analytics engineer?

Okay, that's a good one. So what I did on this, I look into being very innovative, like looking to innovation, new things. You want to build new things. I want to build new things all the time, but I also want to build reliable things. The new things side is for data scientists, like experimenting, experimenting, building new things. On the other side of the spectrum, data engineers. So I usually put the analytics engineers kind of in the middle. Like I want to build stuff, but I also want to have reliable pipelines. And I want to build things that are closer to business. And I want to understand the value of what I'm building.

Do you prefer to bring to make something new, but unstable or do you prefer to have something that works every time? Just that question would filter the personalities for these professionals really well.

I really identify with that so much. I'd be curious to hear where you fall in this spectrum, but I am a deeply impatient person and so I can't stay on one thing too long. I love making pipelines and getting them to a certain point. But then I'm like, okay, let me try something else where I'm learning about the business. Having this like bi-modality, I think is what keeps me forever engaged in this work.

Yeah, you should probably ask Matheus, my co-founder, because he always says the same thing, but I'm really closer to the data scientist when I build new things. My background is in economics and statistics and data science. Our CTO is an engineer. He's a data engineer. He's angry if something doesn't work.

Okay, so you developed a course. You developed an ability to funnel people into what? Data scientist, data engineer, analytics engineer?

Now we also have data analysts, so more in the BI, it's like an analytics engineer, with a deeper BI knowledge. Analytics engineering, data science, AI engineering, data engineering, and we are also adding a data consultant career. We have all these tracks.

And this program is now a six-month program, and we are paying for them to study. So that's also very also risky for us.

I'm glad you said it's risky. One of the things that I think you don't recognize until you run a consulting business is that it is terrifying to face attrition. Attrition is the thing that kills your business. People will quit, life. Things happen. This is the world.

But when somebody quits, it's not only revenue walking out the door, but it is also your investment in them as a human walking out the door. One reason why I think it is very rare for companies to invest in people is that they are going through this J curve where they are not making money at first. Then you're slowly working your way out of it.

I don't know if you've done the math, but I think you could figure out how much you'll pay back for the training you've put in.

For some of these programs, it's not only training. One of the reasons we built the program is that people need to work on something. They need practice. We put them in internal projects.

We have phases. So the first phase is the foundational phase, just learning. We teach them databases, APIs, cloud computing. These things are important for anyone who works with data now, but not important for someone who just graduated from college or is changing careers.

Then we have theory, or the data journey, that's where they train on dbt and data engineering techniques. And then we gradually put them into projects, into real work. They can be a copy of someone, they can work on their own projects, and sometimes they can actually work on real projects. We can even charge them for some clients. Most times we are able to pay for the program with the work they do inside the program. So we have a break even before they graduate from the Lighthouse.

I really feel like so much innovation is business model innovation. To me, what you're describing is a new business model that lets you invest in more people who know how to do great data work. This is very cool.

Yeah, that's what we thought. You can always invest in technology. But in our work in consultancy, it's really humans, right? How can you get the better humans and how can you get them to stay at your company? How can you keep them, right?

We've been very intentional on how to create the social structures that you would find in working in an office. We have a very low attrition rate. If you compare to the market, we are like six, seven, eight times lower than a competitor in our attrition rate.

And that's compared to Brazil, which has a lower attrition rate than the U.S. If you compare to the US, it's like 20 times lower than the U.S.

I recently published my thoughts of 2024 and my predictions for 2025. And the thing that looms largest in my brain right now is the rise of open table formats and catalogs and specifically Iceberg. Are you folks ready to push on this? Because we're going to need everybody implementing Iceberg every.

Yeah, 100%. Yeah, we are. It's still so hard to get access to your own data. Like if you have data in one cloud vendor and you want to use another process platform, you need to migrate everything to another data lake because it's a different file format. It should be easy.

Which is like a great way to make consulting dollars, but you're not driving a tremendous amount of business value. And so that's always very uncomfortable.

Yeah, you finalize the work and you say like, yeah, so what is the value we got? We have to look into numbers and sometimes it's even more expensive.

I don't think there's a lot of value in migrations. It's not something that really drives everyone. People want to build data products.

In the next two or three years, I hope everyone will be focused on building things, products, and getting value from these data products. They will not have to change technology all the time because of a contract or a small feature that their vendor doesn't want to invest in.

And honestly, you need architects who are able to talk to CIOs and CDOs about how to think about this architecture. Because if you've been around in the industry for 25 years, I feel like people in the industry have been burned over and over again. And so they often are a little bit cynical about these types of things.

That's why we always tell the clients to use dbt, You're going to do a new migration. Everyone does a migration. So you need to use some kind of structure that makes it easier. You need to plan for migration. You need to plan for that.

dbt is very good so you can isolate your logic from the technology.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Data engineering at Snowflake (w/ Rahul Jain)

Dan Poppy — Sun, 12 Jan 2025 13:03:00 GMT

Rahul Jain is a data engineering manager for Snowflake's internal data organization. He joins Tristan to discuss the Indian tech scene, Iceberg, streaming, AI, and how Snowflake’s data team does data work.

This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways from this episode

There's this funny thing when you are the user of a software product, but you work for the company that makes that software product, then you have this dual role. You have to be a data engineering manager, but then you also have to explain and advocate for the platform that you're using.

How do you balance these responsibilities? Are you mostly spending your time delivering data outcomes to the business, or are you mostly spending your time on stages in front of audiences?

That's one of the reasons I joined Snowflake. Before joining Snowflake, I was a Snowflake customer. My team was implementing a data platform on Snowflake. It may sound a little cliche, but when I got introduced to the Snowflake platform, it was love at first sight. The ease of use and so many things.

At Snowflake, my core responsibility is building data products and data-driven solutions, which help Snowflake’s internal businesses across different verticals. But additionally, one of the roles I play here is talking about use cases in the platform. I work closely with the marketing team, sales engineering team, and sales team. I give many keynotes, breakout sessions at global events and with the developer community I'm close to.

That's explicitly a part of your role?

That's not part of my role. My role is evolving into that. If you say on the books, the title I own, it's not part of my role. But I love doing it. And the leadership here is very, very appreciative. If you are a proud user of something, whether it's a tech product or any day-to-day utility product, then you automatically try to market it. What I do here at Snowflake, those use cases which I build, I just go and talk about it to the world, to the data community.

Let's do it. Tell me, how does Snowflake do data engineering?

First of all, before I jump into it, I just want to mention that I'm here at my personal capacity. This is not sponsored by Snowflake. Since we’re a cloud data platform, we take data very, very seriously. This is my sixth organization where I've been working in the last 14 years. And it is truly a data-driven organization.

We practice data. We live and breathe data. Not only the data engineering team, all the functions, be it sales, sales engineering, marketing, finance, workplace, every function tried to have this data-driven mindset. My team is a horizontal team within Snowflake. And my team supports different verticals-GTM, finance, legal, and other verticals.

We have a centralized repository of data where all the data which belongs to Snowflake comes to a single-tenant, single platform. And then based on the domain, when I say domain, you can call it the verticals, we cater to them. Most of the time we create data models.

My team spends 80% of the time in analytics engineering creating data models, common data models, some aggregations, and then using data quality, observability, and data governance. We then share this with these business units so that they can create their own analytics if they have their own analytics team. Or sometimes we are engaged in enriching their source system data. We reverse it here or push back this golden data to their source systems, let's say Workday, Salesforce, ServiceNow, Jira, these sources.

Snowflake has been using dbt for a long time. I'd be interested to hear if you feel like your use of dbt is different or novel based on your unique role in the ecosystem. I'd be curious to hear if there's other tooling that you use in your stack that's worth talking about.

Yeah, so the stack is very big, but we’ve used dbt since the beginning, especially for data modeling and analytics engineering purposes. We are very, very satisfied users of dbt. And especially my team, they spend almost 70% to 80% of their time writing models in dbt and deploying them.

What do you think about the talent market for dbt in India? It's still a new-ish product. There's, probably deep benches of talent in India using different ways to get similar jobs done. Do you have a hard time sourcing DBT talent in India or do you think there's a lot of it there?

As you said, India is majorly heavily relying on Spark, Informatica and ETL. dbt is picking up especially with niche companies or new-edge tech companies. But I would say I still find some difficulty sourcing talent.

Do you assume you're going to have to train people when you bring them into your team?

I do, but the learning curve is very, very easy because there’s a lot of documentation available. When somebody new joins the team, there is a four-hour dbt workshop. I ask them to go through it on the first day.

One of the interesting things about India is that often, and this is not true everywhere, people are worried about their budgets. This makes open-source tools like Spark and dbt more popular. But what's interesting is that I think it's fair to say Informatica is pretty freaking expensive. And yet there's a huge base of Informatica expertise in the country. Who is using Informatica, and how do you square these two things?

When you talk about India's cost sensitivity, that’s true. But you need to understand the developer community in India, most of the time, is working for global companies head offices in the U.S., Europe, or Australia. India is not the revenue-generating entity. Thus these decisions of whether to have Informatica or dbt or Snowflake. These still are done in the headquarters where the company exists initially. Informatica is expensive, but this is getting paid from the headquarters.

Changing gears, I think that you are on record as talking about Iceberg in public settings. Iceberg is an open-table format that has kind of taken off in popularity over the past two years.

What do you think is driving customer interest in Iceberg?

One is the interoperability, which results into the no-vendor lock-in mindset. This is a fast-evolving ecosystem, right? If customers want to be agile, then they are looking for some middle ground where they can think about switching the platform or the processing engine they are using currently.

Open-table formats like Iceberg give you that kind of flexibility. You can store data in the open-table format and use a processing engine like Snowflake or Databricks to process it. You save money on storage, but it may cost more because you need the knowledge to use it and then keep it up-to-date.

I think most people agree that larger companies are mostly driving the market. The main benefit they hear is that it's more flexible and can't be stuck with just one platform.

There's a misconception that if you store data in the Iceberg table format, that you've done it. That's what everyone's talking about. But in fact, storing data in the Iceberg format is only a part of the game. The next phase is like, well, where's your catalog?

This is where things get complicated. Snowflake made an announcement about Polaris at Snowflake Summit. There's internally managed, like Snowflake's managed catalog, and then there's externally managed catalogs. And I'm just curious if you could help us figure out the differences between these things and the advantages and the limitations.

You said it right. Storing data in Iceberg format is one thing, but unless you have a catalog, you will not be able to query the latest data or keep track of the latest and all the atomic properties if you want to leverage it, right?

When this Iceberg table format started getting traction, each platform or like Snowflake, Databricks, they all started creating their own catalog. And you need to understand what a catalog is not. A catalog is not storing the actual data. It is just a pointer to the data, which is stored somewhere in the cloud in Iceberg format. A catalog is just keeping the pointer to the latest data or the latest files.

You can think of it as metadata. It keeps track of metadata. Now, where do you keep this metadata? One way of doing this is you keep this metadata with Snowflake in the Snowflake Managed Catalog. You need not worry about the UI or the console or how you and your team will view the catalog.

And if you're using Snowflake Manage Catalog, could you point Athena to Snowflake Managed Catalog also, or is it just to be used by Snowflake?

It is just to be used within Snowflake. From the Snowflake processing engine, you can only query the Snowflake managed catalog. That's why Snowflake came up with another concept of open-sourcing the catalog, or the externally managed catalog. It is currently in the incubation stage. It's called Polaris.

If you don't want to have your catalog managed by Snowflake, you can manage your own catalog. You can bring that code base and create and manage your own catalog in your own infrastructure using the Polaris capabilities. But in this case, you need to take care of the wrapper, the front-end UI you want to put in front of Polaris.

I really did appreciate the clarity that came from both Databricks and Snowflake in 2024 standing up on stage and both saying open-table formats are a big deal and we care about Iceberg. I think it's really meaningful that both CEOs got up on stage and said that. I think it's a pretty reliable indication of where the industry is going.

Do you have any expectations on a performance difference when people use Snowflake native storage versus Snowflake managed Polaris?

I think it's very obvious, right? If the data is stored inside Snowflake, the native data storage performance will be faster for obvious reasons. They will always be faster than the externally managed catalog and then stored in the Iceberg format.

I think of Snowflake's history with AI and LLMs as having two distinct phases. There's the pre-Sridhar phase and the post-Sridhar phase. And then the post-Sridhar phase is more like Cortex. Do you think that's an appropriate way of thinking about this?

Yeah, definitely. Sridhar comes with a lot of experience in artificial intelligence, especially in Semantic Search. He's a technologist, by his work in the past with Google, with his own startup.

The moment Sridhar joined Snowflake, all of a sudden Cortex came into picture.

The core philosophy of Snowflake is simplicity, the way the platform was built. Cortex functions, whether it's machine learning powered functions or LLM functions, these things are so simple to use. And there is so much excitement within Snowflake about these functions. That was the shift which happened post-Sridhar, where everybody is empowered to use these large language models, not directly but in the form of SQL functions. And there is lot of talk about how to expand that and create more.

Since Cortex, have you seen adoption of AI in the Snowflake platform accelerate?

100%. Sometimes I think we are doing too much inside the company. Everyone, not just the data team, but also the project management and all the other non-technical teams, can write SQL. We still have to figure out internally a lot of use cases which will impact at scale. But still these Cortex functions we are using heavily internally.

Do you have any dbt pipelines that are just end-to-end Snowflake dynamic tables? I have not personally gone all in on dynamic tables, but I'm curious if you've pushed it hard.

We are right now in that phase where we are moving some of the pipelines, which were managed through Airflow Decks to complete dynamic tables using dbt. And I would say we are not completely there with end-to-end pipeline using dbt using dynamic tables inside dbt. But one of the initiatives we are currently doing where we are migrating from those Airflow Decks to dynamic tables and on dbt itself. Those are more from the master data management side.

It's so interesting, many of the things that we think about in the industry are, there's an underlying consensus that we're fundamentally talking about batch. I didn't say Airflow, but you added it to the conversation. This makes sense because if you put everything in dynamic tables, then suddenly there's no orchestration.

Have you spent much time thinking about how you provide an observability? What happens if it fails?

Yeah, so you're right. That's why I get very practical and I tell my team the same thing. If someone is coming to you because this is new, real-time or streaming, this sounds great, right? But do you really need it? What's the end use of it? Who is consuming that data? Do they really need real time stuff? If there is no business impact, keep things in batch mode because you can observe them very well. And these are more stable. I will be very transparent, we have not built anything concrete to observe the dynamic table, data processing using dynamic tables.

I don't think you're alone there. We as an industry are still early.

Let me ask you the question to close out the podcast. What is something that you hope is true of the data industry over the next five years?

Data literacy is something which is increasing and especially with the democratization of LLMs. People are taking data seriously. And a lot of tools are evolving very fast. if you can interact with the data using copilots, that puts a lot of focus on data-driven products and the data industry. That's why I'm very, very hopeful for the next five years.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The intersection of UI, exploratory data analysis, and SQL (w/ Hamilton Ulmer)

Dan Poppy — Sun, 22 Dec 2024 13:03:01 GMT

Hamilton Ulmer is working at the intersection of UI, exploratory data analysis, and SQL at MotherDuck, and he's built a long career in EDA. Hamilton and Tristan dive deep into the history of exploratory data analysis.

Even if you spend most of your time below the frontend layer of the analytics stack, it’s important to understand trends in both the practice of data visualization and the technologies that underlie that practice.

All of it deeply shapes the space that we operate in.

This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways from this episode

If you're like other people who started their data science careers in the 2010s, you probably ended up doing all different parts of the analysis pipeline. Over time, it seems like you and your career have become more focused on the data visualization part. Is that fair to say?

When I joined Mozilla, we had what was then considered pretty big data. It was also very complicated, messy stuff that the browser was generating. We were using a lot of that telemetry to basically calculate the numbers for the business as well.

Data visualization for me has always been a means to an end, and that means understanding the data that powers the business and the product. And so I think those interests are intrinsically connected.

Many data visualization and exploratory data analysis projects focus on the end of the analysis process, the presentation layer. This includes things that you might put in a slide deck. But they're having to work with extremely messy, highly nested data generated by the web browser, which possibly runs in a semi-degraded state. Not all the data that it sends is good data.

That first mile is‌ way more interesting to me. And it's maybe the genesis of my interest in data tools in general is that first mile problem, not the last mile problem. This is a place where exploratory data analysis is ‌especially valuable. But I think the way that people think about EDA is more in the middle or toward the end. That first part is‌ really critical.

Exploratory data analysis is used to answer the question, can I trust this data? What do I need to do to it to get it to a state where I can trust it so I know what I can expect from it.

Absolutely. I think this is the fundamental trade-off in a lot of analytics tools. The people that make those tools or the libraries that we use are often made by technical people, oftentimes people with a research background that have to do data cleaning. But the financial value comes from the end-user experience of dashboards.

And so there's always been this trade-off‌ with EDA tools between these two things. That’s the case of Polaris, which is a research project out of Stanford in the early 2000s. These researchers wanted to‌ figure out how to make EDA interactive and exploratory.

This was during a time when computers were just starting to get better at this kind of work and data was being generated. And those researchers at the Polaris paper were groundbreaking for analytics. Those researchers went on to found Tableau.

And the killer use case, wasn't the first mile. It was the last‌, because the economic buyer of the tool cared a lot about understanding what was going on with their business. So a lot of the focus for EDA tools has been on BI rather than the thing that‌‌ really vexes data practitioners—cleaning the data. Everyone likes the joke that 80 percent of the job is cleaning up the data.

So if you look at a model of data work, it's largely about trying to correct problems with data collection as early as possible to figure out what you can possibly say about the business down the line. That's‌ a high-value thing, but it's hard to sell that‌ to people. And that's why I think BI tools focus on the presentation layer.

Can you talk a little bit about how over the last 20 or so years the data visualization industry has evolved? Are we operating at a higher level of abstraction than we used to be?

Imagine it's the 1970s and you're a statistician doing research. You find that you can put your tables of data into the computer to combine and show them quickly. Before, you had to do it by hand. If you've ever read anything by Edward Tufte about historical data viz, you could tell someone that you had a pencil and paper and drew, had to figure out where to put the points to show the aggregation, right? Really time-intensive.

John Tukey was this really famous statistician. He's‌ the person doing exploratory data analysis. He said something that isn't controversial, which is that you should look at the data before doing a statistical analysis. This wasn't easy to do without computers.

It was part of a movement to bring computation to statistics that became‌ the whole point of the field after a certain point.

And then in the early 90s, spreadsheet software—Excel—became the most important data tool ever created. They began adding charts to their spreadsheets, and that was really a great early form of data visualization, probably the most popular form, right?

You have this other cross-current here where the browser became the medium for interactive data visualization, and not some desktop app that a bunch of people have to write C code. That‌ was probably the biggest expansion of the labor market in data visualization.

And that's where you see D3 becoming one of the most important entry points for those people to become essentially front-end engineers.

D3’s premise was that building high-level primitives for data visualization wasn't possible without getting a mid-level connection to the browser APIs. D3 is still widely used. It's not used in the same way as it was in the past, but it's still very widely used. I use it every day for all of the data visualization tooling I build, just because it has so many helpful things that I don't want to build myself at this point.

So the web became important for the medium of data visualization. And really for analytics tools as well. Most of the BI tools moved to being web-based in some way. And so then you had this other cross-current in the last 20 years, which is the tech boom, internet companies, things like that.

You had a huge influx of technical PhDs in the industry. You had all of these people bringing their analysis tools that they used in their research. The scientific Python computing stack was something that you might've‌ toyed with in grad school and used for your research. And now you're bringing it to your job because it's a tool, you know, R is another example of this. This is sort of where I enter the story‌ is as a statistician with R.

My guess is that probably everybody listening here is familiar with the name DuckDB, but you should probably do a little bit of an overview.

DuckDB is an in-process database. The closest analogy would be something like SQLite, which I don't think is a fair comparison because DuckDB does so much more. SQLite is this tiny transactional database that's‌ the most important piece of software ever made. I don't think that's a stretch to say that our lives are powered by thousands of SQLite databases on all of our devices.

Our browsers all have individual SQLite databases powering them. It's an incredible thing when you can just have your database as a file somewhere and then whatever process can just query that directly rather than having a database run on its own independent server. And so DuckDB is‌ like that but for analytical queries, not transactional ones, the kinds of queries that your audience is quite familiar with.

The project started‌ in the late 2010s. Hannes Muehleisen and Mark Roosevelt, who are two researchers at a research institution in Amsterdam called CWI, which previously was probably best known for being the place where Python was invented.

The influence for DuckDB was the workloads that PhD data scientists were bringing to industry. Crunching down CSV files and Parquet files has always been a bit of a challenge. Mark and Hannes realized they could build a database to make data analysis easier. And so that was sort of the genesis of DuckDB.

But as they began working on it and applying some of the most cutting-edge ideas in analytical databases to the project, they began to realize it could do more and more stuff.

And that's ‌why I joined MotherDuck because I'm part of this movement of people that care about data visualization that have discovered DuckDB, and really can't look back.

That divide we were talking about, the front end being all JavaScript and the back end being who knows what. If you can make that back-end DuckDB, and you can do incredible things with it, you can actually determine the future.

The queries you need to run on the front end effortlessly update your UIs. And so it reduces the latency of interactions for really complex things.

We talked about the difference between the BI world and scientific computing world before. One of the interesting differences there is that scientific computing does not typically speak SQL. Is DuckDB capable of doing some of these scientific computing functions or does it not need to? What's changed?

In 2015 there were people that would scoff at the idea of writing SQL, maybe they adopted BigQuery and discovered writing SQL actually wasn't a big deal.

There was a period, especially in the 2010s where people weren't sure if SQL was going to survive. The environment was different then, but one thing that ended up happening was more stuff moved to SQL rather than less stuff moving to SQL.

I was at Mozilla when we bought BigQuery. It was like a breath of fresh air, being able to write SQL, a query that I can analytically verify myself and just have it do the thing was really special.

Industry has moved more towards SQL. That said, DataFrames are amaz ergonomically quite amazing, especially our ecosystem with dplyr for data transformation is like a really elegant, nice way to work with data. Databases like DuckDB can do things in dplyr and it will write the DuckDB query for you which is really nice.

I really desperately love dplyr. It is maybe my favorite of all our packages. I feel like sometimes there's this religious war between SQL people and not SQL people.

What's interesting about this too is a lot of these analytical SQL dialects are starting to also address some of the same types of problems. I think I mentioned BigQuery before, like writing BigQuery SQL and working with arrays in BigQuery and things like that became easier to do some of those hard data manipulation things in SQL.

And looking at DuckDB, something that Mark and Hannes care a lot about is just the ergonomics of writing SQL. So they've made their own extensions to SQL to make it easier to do the sorts of things that you would see in dplyr. SQL is complicated because it's kind of a ancient programming language that has stood the test of time. It's not Latin. It's something else.

The spec for SQL is like thousands of pages and there's not a single database on the planet that actually implements all of the spec. It's one of these bizarre situations we're in where the dialects differ in some critical ways.

If you know one dialect of German, you could speak the other one kind of with other people. It's similar with SQL. This lack of control over the language itself, does two things. One, it frustrates everyone because it's much easier to go to R or Python where everything is well defined and the grammar is small. There's not thousands and thousands of keywords to implement in these languages.

But also, it's an area of innovation and opportunity for other database engines, and I think the DuckDB creators have seen that. And so things like list comprehensions, which are so useful in Python, you can actually do in DuckDB SQL. Function chaining, you can also do that in DuckDB Python and or DuckDB SQL

It's an interesting area of innovation, and I think it also upsets a number of people as well.

There are people who think SQL needs to be kind of this boring thing that everyone knows and stop innovating. I'm much more of an experimentalist. Given that there is no actual SQL standard beyond the bare minimum that everyone implements. Why not innovate?

Can you talk a little bit about how, how is it possible that DuckDB runs locally. Why can't I run Snowflake locally?

To understand this, you have to understand a bit of the historical trends in the tech industry around the concept of big data. So 15 years ago, tech companies began generating large amounts of data as they had users use their applications.

Facebook's a great example. The amount of data that needed to be processed in order for you to understand it was much larger than what one computer on its own could even do. You could go buy a desktop tower, put it on your desk, get all the data on it and attempt to do something with it, but it was going to be extremely slow, right?

The disk space wasn't big enough. You didn't have enough memory to do interesting things. Computers weren't good enough. Computers have recently become good enough, I think. And that's the TLDR.

But because they weren't good enough at that period, we began to see great research out of Google around MapReduce and about splitting up computation across a bunch of rented out computers in the cloud, a bunch of cheap, low powered machines, and this fanning out the computation to a bunch of machines that weren't on your computer was the way that you could actually do anything meaningful.

And that idea took hold in industry. As more companies became data-driven, this was the way to do it. And this is actually part of the core thesis of MotherDuck. It's 2024 now, and I'm using a modern MacBook Pro. And when I switched from SQLite to DuckDB, querying a really large, 10 gigabyte dataset, it was instantaneous.

The first time that I ran a query in DuckDB, I thought, is it broken? I don't think this is possible. MotherDuck’s core thesis is that our computers have actually gotten fast enough to handle those workloads that back in 2011, we could not have done on a desktop tower.And that's really magical.

Computers have gotten good enough to do what was previously considered very big data or big data. The definition of big data has shifted over time to be increasingly larger. And so the things that were big data when Snowflake and BigQuery were created might not necessarily be big data today.

They might be something that you could process on one really beefy computer that you rent from AWS. And so I think that's why DuckDB is becoming really popular, by the way, is because computers have caught up.

There is an architecture, the MPP, Massively Parallel Processing architecture, that became initially popular in the 2000s. You could reasonably say that Snowflake and BigQuery are also versions of an MPP architecture, certainly evolved from earlier versions, but there's probably some overhead involved in. But that all of a sudden can be faster and it can also be distributed to your local machine. Am I getting that right?

I think that's the most concise story for DuckDB and why it's so successful. The fact that it can run in a process anywhere means it's running everywhere. Companies are adopting DuckDB as point solutions all over the place.

And you don't hear about it all the time, but it is absolutely happening. So it may not be the total solution for their data today, but it's oftentimes the best solution for individual pieces. I think I saw somebody joke that DuckDB may single-handedly prevent global warming caused by the JVM.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Making data movement as reliable as electricity (w/ Taylor Brown)

Dan Poppy — Sun, 08 Dec 2024 13:03:09 GMT

Fivetran recently passed $300 million ARR and has over 7,000 customers globally. Taylor Brown, the cofounder and COO of Fivetran, joins the show to talk about Fivetran’s moat, the impact of AI on the data ingestion space, and open table formats and catalogs.

This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

We need you—yes, YOU—to take this year’s State of Analytics Engineering Survey. The findings here guide product development and help us all understand where data teams are going.

Results from last year’s State of Analytic Engeering survey

Listen & subscribe from:

Key takeaways from this episode

The Fivetran mission is to make data movement as reliable as electricity. Is that right?

Taylor Brown: The thinking behind that mission statement is that when we think about Thomas Edison and what he did, he spent all this energy bringing electricity into the house to power light bulbs.

And then what happened after they had electricity in the house was an explosion of additional innovations—hairdryers and washing machines and all the electronics. One of the biggest challenges to innovation is just access to data. And BI, as we thought about it, was really the light bulb of the modern data stack.

I think especially with AI, the innovation set to come is still to be defined. We're at that light bulb moment still.

You've got to have the use case that drives the original infrastructure, but then who the hell knows what the infrastructure is going to be used for next.

Exactly, exactly. In the last 10 years, it's been a lot of light-bulb kind of BI stuff. And the last two years have been this new, fun, more exciting innovation around AI, which makes my life more fun. And, you know, I think what we're doing is more interesting.

You guys are big time. Where's the business? You guys have hit some milestones recently.

We recently passed 300 million in ARR. We have over 7,000 customers now globally. And we're growing at a great clip right now. The last two years were not maybe the best years for Fivetran. It was just a challenging time in the market. And I think there were a lot of folks who pulled back on any sort of innovation.

We've seen a resurgence of growth for ourselves this year, which has been great. A lot of folks are really starting to feel more confident in the market, which ultimately ends up in more spend on innovation, which means we all see more investment in data.

One of the interesting things about this space that you're in is that everybody thinks that they can build data pipelines. And in an environment where somebody up the org chart is looking to save money, there's probably somebody lower down in the org chart that says, “Screw it. I'll roll that myself.”

Is that a conversation that you've had over the years?

It's a conversation we've had and a conversation we continue to have. Especially when you think about the modern data stack where you do more ELT to extract the data and load it with a small amount of transformations directly into your cloud data warehouse. I think a lot of folks that are at senior levels at organizations say, “Hey, you're not even doing the hard part. You're not doing the transformation part. Why would I ever use a tool for that?”

When you get into the details, there's a lot of complexity, as you pointed out earlier, to moving data effectively, doing it accurately, doing it at scale, making sure that you don't miss any data, doing it incrementally instead of batch. There's all this complexity that we put into making it so that we can have this highly reliable replication and copy of your data within the warehouse.

That's a challenge that we have to face with buyers on a constant basis to help them understand why this is cheaper, better, faster than having their own team build it, where the quality can be all over the place. A lot of times engineers don't really want to do this. They see this as a shitty job for them to do.

Moving data from point A to point B is ’t how you build a career.

It's a demotion, really. We got this these pipelines and need you to go do this instead of doing this really important mission-critical stuff over here.

Do you have types of metrics that you show?

We have a lot of metrics that we show. We have uptime metrics. We're working on a bunch of latency metrics right now. I’d say for your average company, we can probably do it better and faster than you can.

But sometimes that one data engineer doesn't want to hear that.

One hundred percent. There's two aspects of this. For the really big companies like Facebook, data really is their business. Building the infrastructure around it is their business as well. If you're outside that, in the build-versus-buy scenario, you're going to end up with something that's better, faster, more reliable, because you have this crowdsource effect. We have 7,000 plus customers who are using the same infrastructure. We're able to really battle test it over a large number of customers and an even larger number of connectors. And we're going to catch all those edge cases, right?

The flip side of that is that you have some engineers who think they can still do it better or faster, or just want to have control over the overall pipeline. There's going to always be preference in any stack. And so we certainly see that, but we try to point to more of the objective outcomes that folks see when they set up Fivetran versus building it themselves. What ends up winning ultimately is that customers just try out Fivetran and see how great it is compared to building it.

There are two different motions that dbt is brought into an organization.

One of them is that somebody in the central IT org gets religion and then they push it out to the business units.

And the other way is that the central IT org is on some different version of the world and they don't get religion and yet one of the business units does. Then they adopt dbt and like a shadow IT are constantly trying to convince the central team to support this.

Do you see this?

We definitely see that same pattern. Most of these large organizations have a central data warehousing approach. They try to have a centralized approach towards data integration. When we see that, we typically have to go through central IT. There are times where you have a central IT team that has built something, but then as you said, you have a separate team—often the marketing team—with their own warehouse doing their own thing because they are moving so quickly. We get success there and we move our way into helping the central IT teams.

Every company has a different pattern for adoption, and we try to fit into as many different ways as we can. But since we touch a lot of the critical infrastructure for them, it's much harder to do shadow IT for that. There's just so much oversight on making sure that data is secure, protected, following governance, all that kind of stuff.

You have 500 connectors now. Do you make money from connectors number 21 through 500 or is it important to say that you have a thousand connectors in a year?

We do make money on the last 200 connectors we've added. Now the amount of money we make per connection is certainly lower. A lot of the systems of record for these older companies are on-premises systems. For them, a lot of the core information that they need to put into their cloud data warehouse comes from those particular systems.

More and more of these companies are starting to adopt additional cloud systems around that. Maybe they'll add Workday or they have Salesforce. They'll add Jira and they start to add some other cloud systems and those also have value to them. They may not be quite as valuable as their bedrock of data that they have in these older systems. I think the newer companies don't have that on-premises system of record problem, because everything they do is in some sort of cloud system.

Fivetran, for example, almost everything is in a cloud system that we decided to buy and that we run everything on top of. For us, and for a lot of our cloud-native customers, it is spread across multiple different sources. We need to have every one of those different connections. And so that's where valuable for a customer to have a single platform that they're getting all of their data.

Even if they theoretically could buy three different tools and combine the list of connectors together, folks really want to buy one data ingestion tool, right?

If you think about it, having three different tools, having three different support systems, having three different account managers, you need to train the team three different times on each of those things. And so if you can pick a single tool and be a standard across the organization, it just makes it a whole lot easier.

We have a new SDK coming out that’s in private preview right now for building custom connectors. Because one of the challenges we've faced is even though we have 600 connectors and we're building 100 or 200 connectors a year, there's just endlessly more. There's 6,000-plus SAAS connections. There's something close to 30,000 APIs available and different business-to-business applications. We're never going to get to 30,000, but if our customers have a platform that they can build on top of that has 70% of the replication built in and the core functionality is there, you know, that's where I think we start to really help our customers use us as the single platform for all their data movement.

A lot of smart people over the years have said ingestion is a commodity. But you folks are empirically proving that there is a really good, defensible business to be built in this category. How do you think about the Fivetran moat?

It's a question we've talked about for years, and certainly a lot of our early investor conversations ask this same question. Like, is this defensible? Right? Why doesn't one of the hyperscalers build all of the same stuff?

Our hypothesis was, which I think has turned out to be true, is that while yes, building these connections in theory is easy, in practice, there are 10,000 edge cases that happen and you only really get to a hardened state over many years and multiple different customers using the same code base where you run into all these same edge cases or different edge cases.

And so the defensibility is really time and bug fixes over a long period of time against the same code base.

The type of problems that we run into versus the type of problems that say an Amazon might be focused on is that an Amazon engineer is focused on the kingdom that they're building within, which is their own kingdom. Ours is completely focused on everything outside of our kingdom. We don't control anything. We're just dealing with APIs and databases and all the other things that we don't own.

And so it's a very different problem, and it takes a very different skillset and a very different group of engineers. And that's what we've optimized heavily on over the last 12 years. It's a combination of who's on the team and then also just a never-ending bug fix. When you set up a Fivetran connector, it's been hardened by thousands of customers and it's going to work.

George wrote a blog post called, “How Do People Use Snowflake in Redshift?”

It posited things like maybe we don't need to use massively parallel processing engines (MPP) for everything. And maybe vendors will supply their own compute for the workloads that they're responsible for. Have you guys gotten any blowback from this?

So far, we haven’t gotten that much blowback from it. George is obviously extremely bright and he has a insatiable appetite for reading and thinking about technology. I think the combination of both of those ends up leading to him being quite a visionary thinker. He thinks a lot about the data space, a lot about all the way down to the database level.

There are a lot of people who are like, “Hey, we need to use a cloud data warehouse because we have so much data.” But when you look at the actual data and the amount of data on average being run in Redshift, for example, it's not that big.

Our laptops have improved significantly over the last five years, 10 years. They can probably run a lot of this compute at the same or faster speed without spending any cost, right? And so these are controversial observations because they go the opposite direction of what we've been saying for a long time.

Ultimately, we care about what is right for customers and where the industry is going. And if something's right for our customers, even if we don't want it to happen, it's going to happen. And so it's better to just face reality and figure out how to live within this new world. Data lakes are a great example of what he was talking about in that blog post. He said you can use your own laptop instead of using a cloud-based warehouse.

I wanted to use the blog post to talk about Fivetran Data Lake Service. Can you tell listeners what that is and how is it’s different from the way that Fivetran worked in the past?

When we first started, it was integration with Redshift. We’d just take your Salesforce data and put it into Redshift in a very automated way. This includes the first sync of data, creating the tables, putting it all into your warehouse, and updating all that data. We effectively own that first layer of data within say Redshift.

And then Snowflake came along. The big innovation there was the separation of compute and storage, where you have this now elastic ability to grow both compute and storage within the cloud. And that was really the advent of the modern cloud data warehouse. I think that’s 100 times better than the previous version, which is the on-premises data warehouses.

Many customers want to be able to use their own S3. They don't want to have to take all the data in S3 and put it into Snowflake's S3, then create on top of it. A lot of customers and people have been thinking about this for a fair amount of time.

There was a previous version of just loading it into S3, which I’d call a data lake version one. This version was more like a data swamp where you just put a lot of data in and then you spend all your time trying to understand what data is in there and changing it to make it logical. The next version of this came through open file formats like Iceberg and Delta. These formats take the organizational style that you get in a data warehouse and use it in a data lake. And you have DDL statements and updates, inserts; it's organized in a logical way. So you get the best of both worlds having this organized data warehouse within your data lake.

And then you can put different query engines on top of that. And there are a few things that had to happen for this evolution to happen. There were a lot of large customers who had a ton of data within data lakes who wanted to access this within downstream warehouses but didn’t want to move the data. They were already paying for the storage here once. They don't want to pay for the storage again.

That customer-first approach really pushed data warehouses to now start to support this concept. And I think that also drove the innovation from the open-source Iceberg community to then build these capabilities and for folks to start to adopt them.

All of these things have come together in the last year. Now customers can load data directly into Iceberg in S3 and Fivetran Data Lake Service effectively does that for them. So instead of loading into Redshift or Snowflake or Databricks directly, we can load to a customer's Iceberg instance.

And this all relies on an open catalog, right? Are you folks using a particular catalog to support this?

Yes, that’s a big part of it. Once the data is in the warehouse, then the question is, well, how do you query within Databricks, Starburst, Athena or Redshift. You need to understand the actual metadata there. And so that's where these open-source catalogs have come out. Polaris is one of them. We’ll also support Unity from Databricks.

I believe this will become the sort of postmodern data stack or the modern datalake stack or something that everyone moves to over the next few years. but I think there's still a lot to sort of be figured out around how to make this more of a turnkey type offering for the ecosystem.

Is it your experience that there are more data leaders who are Iceberg and Delta curious than those who are using it in production today?

Yeah, the early adopters and the ones who drove the initial innovation is the phase that we're at right now. We are seeing a fair amount of folks using this service within Fivetran, but it's not everyone yet. I think part of that is because many people are not ready to use new technology right away. They will wait a while and then use it.

In the conversations I've had this year with data leaders, they're all thinking about it. One reason is that they want to be able to use data for many different things after they move it to a certain place.

There are also some costs to this. It might be cheaper for them to just load it into their own S3 bucket using cheap compute, rather than loading it directly into a warehouse.

I really agree with what you're saying on the turnkey part of this. If you are a data engineer and you try to roll out your own Iceberg support today, it’s really non-trivial.

We were able to ship some dbt functionality at Coalesce 2024 that you just flip a flag and all of sudden your model outputs to Iceberg. I think that’s the type of stuff that's gonna have to happen across the ecosystem to make this widely adopted, which I'm very excited about.

Yeah, totally. It’s very hard to roll it yourself. I mean, it's very hard on the ingestion side and then it's very hard on the bronze, silver, gold side. There's still a lot of pieces. I think what we've done helps the first part of it. What you've done helps the second part of it. There's still more around the catalogs and all of that. I think it’ll come together and it’ll be exciting, but it's still somewhat early days.

Snowflake popularized the notion of separation of storage and compute. I think about this as the separation of compute and compute. Just like multi-modality was never really a thing. You had to pick an engine and go all in on it because otherwise you were moving data around all over the place. And that's just not the case anymore.

Yeah, totally. In one sense, it's interesting because you'd say, well, this is probably worse for warehouses like Snowflake because they're getting less lock-in, right? At the same time, I think it's better in a way because customers don't necessarily want to.

Yeah, make that case to me. I can't see it.

The customer wants to have all the data within their own data lake. It’ll force people, companies like Snowflake to innovate a lot and continue to drive that customer value in the things that customers really care about.

From what I can tell, Snowflake is doing all the right things, focusing a lot on the AI layer on Cortex and building out the key functionality that customers want. If they do this right, they'll get more jobs over time. Many customers already have their own data lake strategy and asked for Snowflake to help them query a lot of data.

And so you forego this old world lock-in for a new world to compete on the things that customers really care about. And that's what makes a business much more lasting.

Let’s pivot to AI. If Fivetran is now landing data in a data lake, do you have any visibility into what people do with it? Are you able to observe folks using this in AI workloads?

Only through talking with them. Using the Edison analogy, we don't know what they're plugging into their outlets. We just know they're using energy; they're using the data that we're moving through it. The AI industry for B2B is still pretty early in a way.

Early on, we were building an internal chat bot. Let's make it super easy. Let's pull the data from all the different sources that we have, like Slack, our internal Wiki, our Docs, and our email and a bunch of other places. And let's just pull those in together and then make those available. We started talking to a couple of different vendors and the vendors asked us to send all our data in a CSV. And we're like, “What do mean send us your data in a CSV?”

We were just so surprised that it sounded very similar early days with BI. We've found that many people thought AI was its own industry and the infrastructure for it was its own industry. And the way we think about it more is that your BI stack and your overall data platform is the foundation that you build your AI on top of.

Now we have a lot of customers who have been successful in building out various AI platforms on top of the data that Fivetran delivers. And that is where I think things really start to get interesting. That’s when companies really think about them as a singular platform, like what we did for our internal chatbot.

I think where a lot of people are sort of going sideways is they're not just like thinking about reliable access to their own data. The difference between what companies can do within OpenAI and what they can build with their own data is that their own data gives them a competitive advantage. That's the thing that only they can access. A lot of folks are ’t thinking about it at that level yet.

We have not yet unlocked enough downstream use cases to make the infrastructure that both of us are powering have the level of attention on it that it needs to get to the 11 nines of reliability that S3 promises.

One of the things that's exciting to me about AI is that it is going to drive a lot more attention onto the quality of the infrastructure that Fivetran and dbt are providing.

I completely agree. Again, I think we’re still in early days where folks are still tinkering with it. Folks are investing a lot but haven't had real gains from it. And I think once it starts to get more traction over the next year or so with the actual applications that companies are building on top of data with AI, that's when the pressure starts to build around the infrastructure underneath it. And that's where it really starts to harden. I'm just not sure we're there yet.

And I think that's where we are seeing folks who are building on top of Fivetran infrastructure being successful with this. I can't talk a whole lot about it, but OpenAI is building on top of Fivetran. That's a pretty good AI use case. And now there's a lot of other companies as well.

It comes back to, as you said, it has to be reliable, it has to work, it has to scale out.

We’ll often get asked about unstructured data when we're in conversations with folks on the topic of dbt and AI. My answer is generally no. People are ’t transforming data from customer call WAV files or reading PDFs. Are you playing in the unstructured data world?

We're starting to. We just recently added support for PDFs. A lot of folks had a massive SharePoint with tons of emails. And that's the first step into it.

AI allows you to make unstructured data more structured. You can take all of this data from your email, for example, that's quite valuable to you, and apply some of the same concepts we've done in BI successfully now.

When you apply the right embedding and model on top of this unstructured data, then you can do a whole lot more with it. We're transcribing a lot of our sales calls into text to see what we can learn. And then those are fed into our internal chat bot that then helps us train and helps our internal team ask questions of like, “How does ”

I’d bucket the things that we've talked about so far as Fivetran for AI, but there's this whole other bucket of AI for Fivetran. How’s Fivetran’s product going to change as a result of AI?

Yeah, so it's funny because when AI really started to take off, we sat down with our CTO Meel Velliste, who's very smart, PhD in machine learning. And we said, “If AI is going to put us out of business, let's be the first to do it.”

We built an AI app that we can point to APIs. It will read the documentation and make a full application or a full connector for us. And then we have a human who goes and looks at it, mostly an analyst versus an engineer who then reviews and tweaks t. And then that's how we're building so many of these long-tail connectors through this process.

So I thought this was a cool new idea for your product roadmap, but you did this a year and a half ago.

Another one was looking at the logs for errors. You can imagine that across 600 different connectors, you get tons of different types of error messages for all different kinds of things. And so it was really hard to surface those errors appropriately to customers within our UI when something went wrong. A lot of them were very unhelpful. A lot of them, our customers couldn't do anything about them. And so we needed to surface those errors to Fivetran internally versus externally. And this has been a hard challenge for many years.

We built an AI app on top of all of our logs that then goes through, breaking them down into 51 different types. Whereas we had 350 before, but a lot of them were duplicates. It’s been hugely helpful for us to debug what's going on and make sure our support team is jumping on the right things.

Humans are really good at fixing things if they know what the problem is. And machines are really good at scanning through tons of data and understanding the patterns and what's happening.

I think a lot more of that will continue to happen, especially as like we add more and more data sources, and with more complexity we're really focused on making the latency as short as possible.

You folks have been on the record over the years as being a little contrarian on streaming. Streaming often gets a lot of tension. There’s a lot of hype, that faster is always better, but there's been some scrutiny around that too. What are you folks seeing now that's making you pay more attention to latency?

In general, 5-10% of organizations need streaming data where it's in real-time. There's some workload or on-the-floor dashboard that folks are looking at in their manufacturing plant or whatever.

But a lot of times, executives across organizations will say they need real time. But what is the actual outcome of this? The problem with real-time streaming is that there's a really high cost. It's a ton more data. There's a lot more tooling you have to build. It's a lot more complicated. Generally what we found is that for 90 % or more of cases, micro batches work quite well down to 15 minutes, or one minute.

Now we're realizing that if we can get down to five-second latencies, that may move away from needing to have streaming. Streaming may become 1% of your overall use case. Customers generally want things to be faster. We're doing the hard work to get us there.

What’s something that you hope is true of the data ecosystem over the coming five years?

I hope that the data lake ecosystem turns into the core ecosystem that people are building on top of. I think it’d be better for our customers ultimately. And I think it provides a lot of optionality and obviously the tooling has to all build around it as well. So I think, you know, in five years, that's what I’d hope for.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

Data as an assembly line (w/ Cedric Chin)

Dan Poppy — Sun, 17 Nov 2024 13:03:19 GMT

Cedric Chin runs Commoncog—a publication about accelerating business expertise. He joins Tristan to talk about the analytics development lifecycle, how organizations value (or misvalue) data, and why “data teams are not some IT helpdesk to be ignored.”

This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Listen & subscribe from:

Key takeaways from this episode

So the thing that made me want to reach out to you and have this conversation is that I published a precursor to the full analytics development lifecycle (ADLC) post that I put out. And in it, I talk about how the role of data analysts is to do insight generation. And I made the statement that analytics isn't fundamentally an assembly line

And it was around, really how we should perceive the value that's created by data analysts and by data practitioners more generally, and how we should think about the value that data generates. I was representing what's probably almost conventional wisdom.

But your pushback on this topic was, what if we do conceptualize data more as an assembly line? Why don't you try to frame what you see as the value that data can provide to organizations when it's living up to its fullest potential.

Cedric Chen: Maybe I should start with the conventional view of data in the data world. Data professionals I talk to say they answer questions, generate insights, the business person asks a question and then the data professional give them some answer after sometimes very hard work of figuring out correlations, relationships, whatever.

And then the business person, and this is a common experience that I've heard from every data professional, the business person goes, okay, cool. And it does nothing with it, right? Not always, sometimes it really does lead to impactful stuff. But a lot of times it turns out that the business person sort of carelessly asks you as a data person or the data team a question that actually doesn't matter to the business. They're just curious. And then they don't know that you have to, you as the data professional have to go through hell and high water to get the answer to that question.

And a data professional told me after reading a whole bunch of stuff that I covered, after I'd gone through not just what Amazon did, but also other companies like Amazon who use these ideas and how they actually use data, that it's almost as if the problem with most businesses and most business people is that there's too many questions they can be asking. And there's no way to narrow down the set of questions that really matter in the business. And because you can't differentiate between what's a good question, what's a bad question, the data team is like this service desk that is overwhelmed with questions.

It turns out that in really good data-driven companies of the type that I dug into, the term that I'm using is process control or statistical process control, which is the set of ideas that was used to ramp up production for World War II and then was used to transform industrial Japan post-World War II and led to the creation of the Toyota production system. This is a particular style of using data called statistical process control. And the basic idea is just, how does your business work? Your business is a system. It's a process. It has some inputs. It has some outputs. And you can figure out what those inputs and outputs are.

Amazon has a framework that is much simpler to understand, which is controllable input metrics and output metrics. Obviously, the ultimate output metrics you care about in the firm, in the company, are things like financials, profit, revenue, free cash flow. But there's a complex set of inputs and intermediate inputs as well that lead out the other end. And there's some lag from the inputs to the outputs. Your executives running the company, running this complex machine, need to have an idea of how the inputs flow true to the outputs.

Now, given that frame, how do you actually piece together the causal model of your business? And the core thing you have to grapple with is that it's not that easy to say, “OK, we've driven these inputs, and then we get some outputs out the other end.”

The problem that we have to deal with as business people is variation. Most people can't deal with variation. Variation just means that something wiggles. And we know that when we step on a weighing scale that our weight wiggles, right? It doesn't just stay the same. But it's even worse in business. Metrics can wiggle, sales metrics can wiggle upwards of 40, 50%. And that's perfectly normal. Nothing else has happened.

And so if you have a way of differentiating between a real wiggle that you have to worry about and you should investigate and routine variation, which is a normal wiggle you can ignore, you have a way of separating signal from noise. And you have a way of finding out things that actually move the needle on the metrics that you care about or the outcomes that you care about.

Now what this does, if you're able to differentiate between routine variation and exceptional variation is that you unlock the most common of human learning loops, which is trial and error. And that data usage context results in a way of using data and asking questions of a data team that is very different from a company that doesn't have this weekly process of figuring out how the business actually works.

I follow along with the fundamental insight here, which is to say that, what we should be doing is building a causal model for our organization. And that requires defining key metrics and understanding the relationships between the metrics.

And then having a reasonable statistical view of what's noise and what's actual variation. And then making decisions based on that. But obviously the devil is in the details. Does this model of the world always work or did Amazon happen to have a perfect business model where they could just observe the inputs and outputs and draw causal relationships really effectively?

Yes, it works. But there are certain things that are harder to measure using this methodology than others. So marketing attribution is not solved by this method. It's really, really hard. And if you look at early Amazon, they defaulted to things that they could actually tell there's an impact, even if the impact is vague.

They didn't do TV ads. They did it for a while and they realized that, we can't really find controllable inputs and controllable outputs. We can do controllable inputs on the cost of the amount of money we spend on TV ads, but we can't tell the output, right? So let's stop. So what did they do? They defaulted to affiliate marketing.

This was affiliate marketing in the 1990s. You can imagine how bad that was. And they could sort of model that behavior. The other thing that they could do was they could model word of mouth growth. They could say that within a certain number of months, a certain percentage of people who turn out to be new users in this new purchases in this first month, right? They didn't even have the term “cohort”. They called it “vintages”.

So by the fourth vintage, would be like 70 % that will become a steady state number of customers, right? It is possible to figure out.

One of the nice things about running Commoncog is that it's a dinky little business, and it doesn't really matter if I tell you about how my business works, and I can tell you my metrics.

We wanted to figure out how the newsletter grows. The first step is just measure. And then every week, take a look at your WBR. We do a full Amazon WBR. You go take a look and figure out what routine variation looks like. So you develop a fingertip feel of what normal looks like.

And what's some of the controllable inputs that you can think about? Well, LinkedIn posts or Twitter posts. And it turns out that there's a linear relationship between visits to the site when I post on Twitter and if I ramp up my Twitter posting, there's a relationship and increase in visits from Twitter. But there's no change in new starter signups. Similarly for LinkedIn. Maybe some of them do sign up, but it's not exceptional variation. It's not special variation. One day, October last year, we saw exceptional variation in in-depth readers, which is people who read more than one page, and unique new starter signups. were like, holy shit, what happened?

And it turned out that a 20,000 substack, an investing substack, had linked to Commoncog. Now, that's interesting. And it turned out, by the way, that the author of the substack had tweeted just one week before and no exceptional variation on any of our metrics. And this person had 70,000 followers on Twitter, but zero budget on metrics. So what's the obvious thing? The obvious thing is, OK, we need to go run an experiment.

What if in terms of the bang for buck for for our effort, it makes more sense instead of posting on social media or hiring someone to post on social media, to go and hunt down subsets and try to get them to link to us. It could be by buying ads and we have to experiment with that, or we could try to be friends with them so that they link to us naturally. There are a range of experiments that we could do because we now know that this is a lever that will result in a change in the output metric that we care about.

Nothing you're saying should be hard. And it's certainly not technically hard if you're good enough. For some reason BI tools don't generally draw XMR charts, right? You have a tool that does that, right?

All the data tools are good enough. Yeah, we have an open source tool that we made because BI tools don't generate XML charts and we want people to steal it. We want BI tool vendors to steal the code.

You want to drive change in the industry so that people can do this instead of BI tools.

The funny thing is that the most people who are using the tool are business people who want to get results. And they don't have data teams, so they can't bother to deal with the data team and wait for them. So just give me the CSV, and I'll paste it into this tool so that I can run the experiments, which is sad. Commoncog is for people who want to get good at business, because I want to get good business. So I bring them along for the journey.

The open source thing is more like I feel for data people, I have an affinity for and I empathize with them. Plus my wife is a data analyst, so I feel her pain. And it's just I'll give you an answer to the question you did in articulating. Why is this not more widespread?

What I was poking at originally was that maybe it's not relevant in all cases and maybe it is to a greater or lesser degree relevant in different cases, but I think that you would probably take the belief that it's generally a useful tool in most contexts. And my guess is that it somehow has to do with organizational dynamics. Is that right?

Yes, it has to do with power. It has to do discipline. It has to do with will not skill. You have to bear in mind that Amazon did all of this in 1997 with Excel 1997, which sucks. And they could forecast within a 3% error rate their growth of their business at the time.

Amazon got to a billion dollars on Excel. So if they got into a billion dollars running the WBR on Excel, what excuse do other organizations have? You have the modern data stack. You have everything that we didn't have back in the day. And literally, the way it worked was that every Sunday night in a shared folder, all the departments would drop Excel files into a shared folder.

So if the tools are not the problem, what's the problem? The problem is the social technical dynamics, right? The WBR is not just a metrics review meeting. It's also a political tool.

If somebody who is an executive asks one of your metrics owners under you in your department a question that you cannot answer or you don't know the answer, you'll be shamed in front of the entire organization. So what happens in that kind of political context? You really care about data and you don't ask stupid questions of your data analysts because what matters in this dynamic is you have a set of output metrics you have to hit by the end of the quarter.

You have to figure out what the input metrics are. And the rule is we don't talk about the output metrics. We only talk about the controllable input metrics. So you need to go figure out what those controllable input metrics are. So now a fire is lit under your ass to work with your data team. And the data team is embedded inside your organization to figure out what those controllable input metrics are. And the way you figure it out is that you do trial and error quickly. You drive this and see if it pushes the output metric after some lag. No, doesn't work. Let's try another one.

And you have to instrument, and it's OK. If you say that you’re still instrumenting, they say, “OK, it's fine; you can't push the controllable input metric. We'll give you some time to instrument.” Everybody understands that in the org, right? But once you do, you better figure out what the controllable input metrics are. And then you start setting targets. And then it becomes a very tight process where there's a fire under us because every Wednesday morning you have to present to executives.

It's a system that once it's working, I can see it being incredibly effective and self-perpetuating. It is also a system that I can really imagine being hard to create in the first place. Because most executives at a company do not actually want to subject themselves to that type of scrutiny in front of all of their peers.

One beautiful thing when you have a business review that measures the company end to end is that when there's a problem in one part of the company, you can say, hey, you in the different department, can you help this guy out? We're part of a team. Let's work together. Because things that change in one department, especially if you figure out controllable input metrics, output metrics. One person's output metric is sometimes another person's input metric. And so everybody has a bird's eye view of a very complex business.

Amazon's WBR is designed so that you can do 500 metrics in exactly 60 minutes. You don't go over time except during the holiday season. But the point is just the practice that matters.

There was a trend, over the last 12 months or so in the data community, at least the data people that I'm connected to on LinkedIn, talking about metrics trees. That's just as good. The point is that you need some kind of practice like this. And unfortunately, it requires shoving down by the CEO. At least all the successful examples I've seen has been somebody in a position of power on the executive team enforcing this.

And when it works, it is a wonderful environment to work in as a data person, because everybody is motivated on the same business problems. Everybody has the same causal model of the business. And the data team is not some IT help desk to be ignored. They are critical to achieving your goals.

One of the reasons data people don't talk about this more on LinkedIn is that it's not something that's controllable for them. It requires the type of executive sponsorship that then gets books written about it.

Probably many data people have never worked in a context like this. I think it probably was more common in an era of where they were more physical manufacturing type processes that we were trying to measure. So probably the experience set of people operating this environment is just actually not that high. Would you agree with that?

Somebody in my community, observed that this kind of operational excellence only emerges when you have a very low return on invested capital. Not very low, single digit. Because if you think about it, if your margins are super, super high, your return invested capital is super high, because like Google, you can be sloppy. It doesn't really matter. Who cares? You have a network effect. You're just generating gushes of cash.

If your return on investor capital is negative, then it doesn't matter. Your operational excellence is just staving off the inevitable. You're just going to die. You're in a textile mill in America, and you're fighting against it. But if you have a 3% return on investor capital, an operational excellence can double that for free, for effectively free.

These kinds of methods tend to spread in organizations or industries where you have a single digit return on investor capital, because if you don't have it, then you die. It just so happens that Amazon is a low margin business that just happened to hire somebody from manufacturing to deal with the scale of volume in their fulfillment centers.

And then the ideas spread out and they use it to crush their competitors because the competitors just did not understand what was going on in their businesses. I often tell people that maybe you can get away with this in high margin businesses, right? But if you are in a low margin business and you up against Amazon, Amazon has process control and you have something naive like the North Star metric framework, you're going to get crushed because you are just working towards one metric.

Amazon is working towards like 16 and they can spin off things to like, let's just undercut you in this particular way and we'll process control that and process control the cost that we're spending to undercut you. And we can do pincer movements. So it's an organizational capability that is super powerful.

This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.

Book a demo

The data jobs to be done (w/ Erik Bernhardsson)

Dan Poppy — Sun, 03 Nov 2024 13:01:18 GMT

This is Season 6 of The Analytics Engineering Podcast. Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.

Erik Bernhardsson, the CEO and co-founder of Modal Labs, joins Tristan to talk about Gen AI, the lack of GPUs, the future of cloud computing, and egress fees. They also discuss whether the job title of data engineer is something we should want more or less of in the future. Erik is not afraid of a spicy take, so this is a fun one.

Listen & subscribe from:

Key takeaways from this episode

You might be only the second person who's a repeat guest. You were on during Season One, and back then you were still in stealth mode with the project that has become now Modal. What have you been up to in the last few years?

Erik Bernhardsson: When we talked, it was the depths of COVID. I had just quit my job and was hacking on something that's now turned into Modal. And back then, my idea was I wanted to build a general-purpose platform for running compute in the cloud. And in particular, focus a lot on data, AI, machine learning use cases. Three years later, I'm still working on that.

What we discovered along the way was AI and Gen AI is a great application, because when we started building this, I didn't have like a clear use case in mind. I just had an idea that if I build this platform, people will use it because it seems like a gap in the market. Turned out that Gen AI is a killer app for what we built.

We've seen a lot of interest in using large-scale applications for audio, video, image, diffusion models, biotech models, and video processing. So we run a big compute cloud, a big pool of GPUs, CPUs in the cloud. And then the other side of that is we offer an easy-to-use Python SDK that makes it very nice, very easy to take code and deploy it into the cloud and in a way where you don't have to think about scaling and provisioning and containers and all that stuff that typically you have to do if you build your own stack.

You are building a compute platform that's built on top of the cloud providers. Why do people need Modal? How is it different than just operating directly with the services that organizations like AWS provide?

There’s more room in the cloud space. AWS, Oracle and Azure and all of these have done a fantastic job building a foundational layer of storage and compute.

I've used AWS for a good 15 years, and I love AWS for what it enables me to build. But it's still not easy to use. And it still gets in the way of iterating quickly and shipping things. AWS and others have a very solid place in that stack; they're an amazing compute and storage provider. But there’s plenty of room to innovate in the layer above, which I think of as the sort of layer that you were talking about, which is the Snowflake layer, and that's also where I would put Modal. We repackage a lot of the cloud primitives in a way that suits what data, AI, and machine learning teams want to accomplish. And by focusing on one particular use case, we can offer a much better user experience.

Cloud providers are hard to use because they try to build for every user at the same time, which means no user is going to have a good user experience. At the end of the day, they're massively successful businesses generating tons of money from storage and compute. That's where I think they belong. That's the core part of the stack.

Anything above that is going to try to drive usage of storage and compute. They just want to drive demand to the underlying services. And you've seen this with the success of Snowflake. Snowflake, builds on top of AWS and the other clouds and in a way competes with them, but not really, because at the end of the day, like AWS and the other clouds, they get the money either way.

So there's room for a layer on top of the hyperscalers and Databricks and Snowflake are both sitting in that place today, and you're amore nascent entrant there. How do I frame the thing that you're helping users with versus Databricks—and especially Databricks because they’re very focused on AI use cases and you're doing a lot of work there too.

I hesitate to position ourselves against Databricks because in many ways they're a fantastic company. But aspirationally, we share the same vision of what we want to accomplish. They're obviously massively ahead of us by 10, 15 years, but their vision is the same; they want to build an end-to-end platform to serve data, AI, and machine learning needs.

We come at it with a very different architectural approach. We basically said, you can't run this yourself. We're going to be not just the software layer, we're also going to be a hosted infrastructure as a service platform—building for containerization, building for cloud, building for this multi-tenant super elastic compute pool. That meant that we could make very different architectural decisions. Obviously I have bias, but I think we have the right tailwinds because we're thinking about architecture in a very different way.

There are a couple of companies recently that are doing some version of helping you have an abstraction layer across the different hyperscalers and across the different availability zones to do resource pooling. Why is that happening today?

I honestly think it just comes down to the fact that GPUs are expensive. In order to make the economics work, you have to run them at very high utilization. And because you run them at very high utilization, you're going to have poor availability, which means that for any single availability zone, you may actually be close to capacity most of the time. Which means in order to do this well, you need to go to different availability zones, you're to need to go to different regions, you need to go to different clouds.

It's a big part of what we've been spending time doing—integrating with a bunch of different cloud vendors, using all the different regions and zones, and then just getting capacity to the customer wherever we can find capacity. That's actually a fun, interesting problem in itself. We monitor its prices, which change dynamically 24/7. We solve a mixed integer programming problem to figure out the optimal placement, given the resource constraints. How do we allocate the pool of machines in the cheapest possible way? So this is a fun, interesting thing.

There's a fixed number of GPUs in the world. There are fewer GPUs than the number of workloads in the world, I think. that a true statement?

The fundamental economics of GPUs is that most of the cost goes to Nvidia. To recoup that cost, you need to run them at very high utilization. Most of the CPU cost is power, which means that it's a more variable cost, which means that you don't really care about utilization of CPUs as much. So AWS could just over-provision and run things at much lower utilization. But GPUs to make the economics work, you need to run that at high, high utilization. Hopefully GPU prices will come down. That's what I'm hoping for. But right now, it’s a supply and demand problem.

If somehow GPU prices come down over time and they become more like x86 processors in the way that the market works, do we still care about all this hard work that you're doing to combine resource pools of GPUs across multiple availability zones?

Maybe it becomes less relevant, but on the other hand, the value of the platform then becomes more important. I think a lot about this for a lot of AI startups. Are you long on GPU prices or are you short on GPU prices? If GPU prices go up, what happens to the value of your company? Does it go up? The truth is if GPU prices were to crash, It would be hard for us in the short term because we have a bunch of long-term contracts and the revenue would go down quicker.

But I actually think in the long run, it would be good for us. Because having an abundance of GPUs is very good for customers. It's good for the world. But I also think for a lot of infrastructure providers in a way, we focus on the software, not the hardware. And if the underlying cost of the hardware goes down, the relative value of the software goes up.

Let's talk about egress fees. One of the driving forces of being cross-cloud, cross-availability is GPUs. In the past, one of the reasons not to do that had been, it's just really expensive to move your data around. I think in certain cases that's still true, but it's starting to change. Can you say more about what's happening with the egress fees?

AWS still has very high egress fees, but they're coming down. I think there's a lot of pressure from R2 and Cloudflare, for instance. The interesting thing about Gen AI is egress fees don’t really matter that much. And that's been a weird re-architecture of a lot of compute. Part of why we've been able to build a multi-region, multi-cloud architecture is that if you think about it, like Gen AI doesn't need a lot of bandwidth. It's very compute-hungry. The bandwidth data is actually very little compared to the compute.

I have a feeling like over time, egress fees will come down and region distinction will matter less and less, except for latency-sensitive applications. But it turns out a lot of stuff is actually not that latency sensitive.

One of the biggest conversations in the data industry today is what's going on in the file format catalog wars—Unity, Iceberg, Delta—and there's a lot of focus on making sure that different systems can talk to each other. But one of the things that we're not talking about yet is that inevitably if you have a global company, you're probably not using a single cloud provider in a single availability zone. So you also have to solve a fabric issue of where the data is physically located. It's not just a format thing.

Yeah, that seems hard.

Okay, let's go from the future of the cloud to the future of data engineering. You spent a long time as a data engineer or building tooling for data engineers.

I was at Spotify for seven years. I built a lot of music recommendation system. And then I was at a company called Better for many years as the CTO. I've been focused on data, AI, machine learning. The precursor to all of that, or the prerequisite is you’ve got to get to data. And so that sort of necessitates doing a lot of data engineering, data cleaning, building data pipelines. I ended up building my own workflow called Luigi.

No one uses it today, but you're skipping by a time period in which a lot of people used it.

A lot of people used it 10 years ago. I was deep in that swamp. Data engineering is funny because in a way, I kind of don't want it to exist. My prediction has always been it's going to go away at some point.

I want to explore this with you because in many ways, I agree with you. At the same time, ther are a ton of humans in the world that call themselves data engineers that use dbt. And so the last thing in the world I want to do is like say that data engineers suck because I don't believe that. The funny thing though is that technology progresses over time and the jobs to be done that humans need to do change.

I've seen this so many times a team of data scientists benefiting tremendously from just injecting a lot of data engineering skills. Suddenly they can get the data.

I don't really like the idea of it being someone's job to shuffle data around. I want everyone to think about what the business needs and to build business applications. I would say the same thing about any internal platform team. All these internal platform teams tend to be somewhat ephemeral and transient.

All these titles too, right? There are data engineers, data scientists, analytics engineers. To me, it doesn't matter. There are always going to be people who need to work with data, AI, machine learning. and that slice is going to grow and grow and grow. But the actual composition of those teams is going to change a lot. And so I don't really pay that much attention to titles.

I just wrote a white paper on the analytics development life cycle. And there are three different jobs to be done in the analytics development life cycle. There's a developer—people who create reusable assets for other people. There's an analyst—people who interact with the data to try to draw conclusions about the real world. And then there's a decision maker—people who get the recommendations from the analyst and make decisions. If you start slicing it up more than that, I think you inject friction into the process as opposed to add clarity.

I have to compliment you on creating that label analytics engineer. Because at that time there were so many people out there unsure how they fit into their organization. You brought them an identity. That was eyeopening for so many people and created a sense of belonging in a community.

It was my favorite thing was when people told me that they got an analytics engineer title and their pay went up 50%. I was like, great, but you're doing valuable work. You should be paid for it.

You gave a lot of people recognition and I think that you should get a lot of credit for that.

We've talked about data engineering is a valuable thing to be done. There's a trajectory here where probably fewer humans need to turn knobs and dials. What about machine learning and AI? So there are ML engineers. More and more people describe themselves as AI engineers. Do we need specific titles for these things or are we all just software engineers?

It's a difference between AI engineers and machine learning engineers, that AI engineers use TypeScript and machine learning engineers use Python for the large part. A lot of the recent stuff around LLMs—not to trivialize it—but it's a low-code machine learning tool for people for which machine learning felt unapproachable. Suddenly they were given a new tool and they could stitch a bunch of prompt engineering and get it to work.

Let's wrap up with the question that I love to ask everyone. What do you hope will be true of the data AI software industry over the coming five years?

I hope that people will never have to think about containers, infrastructure, provisioning, or resource management. I really hope that all of that will be abstracted away in the next five, 10 years. That people will focus on application code and business logic and building cool AI shit, and then rely on infrastructure to take care of all the other stuff.

This newsletter is sponsored by dbt Labs. Discover why more than 30,000 companies use dbt to accelerate their data development.

Book a demo