<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Analytics Engineering Roundup]]></title><description><![CDATA[The internet's most useful articles on analytics engineering and its adjacent ecosystem. Curated with ❤️ by Tristan Handy.]]></description><link>https://roundup.getdbt.com</link><image><url>https://substackcdn.com/image/fetch/$s_!9uGH!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8b4e3170-43ea-4f13-8662-f4b4e18cfe12_256x256.png</url><title>The Analytics Engineering Roundup</title><link>https://roundup.getdbt.com</link></image><generator>Substack</generator><lastBuildDate>Mon, 09 Mar 2026 17:13:21 GMT</lastBuildDate><atom:link href="https://roundup.getdbt.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[dbt Labs Inc.]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[analyticsengineeringroundup@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[analyticsengineeringroundup@substack.com]]></itunes:email><itunes:name><![CDATA[Tristan Handy]]></itunes:name></itunes:owner><itunes:author><![CDATA[Tristan Handy]]></itunes:author><googleplay:owner><![CDATA[analyticsengineeringroundup@substack.com]]></googleplay:owner><googleplay:email><![CDATA[analyticsengineeringroundup@substack.com]]></googleplay:email><googleplay:author><![CDATA[Tristan Handy]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The Iceberg ecosystem today (Anders Swanson)]]></title><description><![CDATA[What can data teams realistically expect when attempting to run on top of Iceberg in production?]]></description><link>https://roundup.getdbt.com/p/the-iceberg-ecosystem-today-anders</link><guid isPermaLink="false">https://roundup.getdbt.com/p/the-iceberg-ecosystem-today-anders</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 08 Mar 2026 13:02:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/youtube/w_728,c_limit/K7PvwU5ulrA" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The data industry is moving towards open standards. The migration towards open standards throughout the data ecosystem is happening rapidly despite all the oxygen getting sucked out of the room from the rapid progress of AI and agents. </p><p>The dbt Labs data team is moving to an all Iceberg lake with a mix of compute engines to power transformation, analytics, and agentic experiences. The team has been able to move quickly towards this architecture because the entire ecosystem has been laying the groundwork for years. All of it&#8217;s coming together to make this new open world a reality fast. </p><p>On this episode, Tristan discusses the reality on the ground for data practitioners. Where&#8217;s the Iceberg ecosystem today? What can practitioners realistically expect when attempting to run on top of Iceberg in production?</p><p>Tristan is joined by Anders Swanson, a developer experience advocate at dbt Labs. Anders has spent a lot of time over the years navigating open-source data ecosystems and tracking their progress. </p><p>They unpack the open standards shift, define the core building blocks (query engines, object stores, catalogs), and dig into why external catalogs have become a fourth namespace tier across platforms. Anders outlines a pragmatic, phased adoption model for Iceberg integrations, explains why metadata performance and resiliency are hard requirements, and clarifies why vended credentials exist and what they solve.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>The call for papers is open for dbt Summit 2026.</strong> We invite data practitioners, platform leaders, and executives to share real stories of how data gets done at the world&#8217;s largest gathering of dbt community members. If you ship fast, reduce costs, improve trust, or bring governed AI to life, the dbt community wants to hear from you.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/dbt-summit/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q1-2027_dbt-summit-2026_aw&amp;utm_content=dbt-summit____&amp;utm_term=all_na__&quot;,&quot;text&quot;:&quot;Submit a talk&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.getdbt.com/dbt-summit/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q1-2027_dbt-summit-2026_aw&amp;utm_content=dbt-summit____&amp;utm_term=all_na__"><span>Submit a talk</span></a></p><p>Coalesce is now dbt Summit. Join the world&#8217;s largest gathering of dbt users, where data leaders and practitioners come together to shape the future of data analytics and AI. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://www.getdbt.com/dbt-summit/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q1-2027_dbt-summit-2026_aw&amp;utm_content=dbt-summit____&amp;utm_term=all_na__" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!shpb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 424w, https://substackcdn.com/image/fetch/$s_!shpb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 848w, https://substackcdn.com/image/fetch/$s_!shpb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!shpb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!shpb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:906976,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:&quot;https://www.getdbt.com/dbt-summit/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q1-2027_dbt-summit-2026_aw&amp;utm_content=dbt-summit____&amp;utm_term=all_na__&quot;,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/190147989?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!shpb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 424w, https://substackcdn.com/image/fetch/$s_!shpb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 848w, https://substackcdn.com/image/fetch/$s_!shpb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!shpb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd97fe3b0-9606-4a9d-8a5f-6e7970f032c1_3840x2160.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:true,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" loading="lazy" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://www.youtube.com/playlist?list=PL0QYlrC86xQm83Q9deiy4euEnbw8ceu3I">Youtube</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><div id="youtube2-K7PvwU5ulrA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;K7PvwU5ulrA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/K7PvwU5ulrA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Key takeaways</h2><h3>Tristan Handy: I wanted to have you on because of work you&#8217;ve been doing internally to summarize the state of the Iceberg ecosystem. We&#8217;ve talked about Iceberg a bunch lately with folks deep in specific parts. Your work is more of an overview: where we&#8217;re at with platform integrations, what&#8217;s easier now than a year ago, and what&#8217;s still hard. Before we dive in, I want to define a few terms. When you say &#8220;query engine,&#8221; what do you mean?</h3><p><strong>Anders Swanson:</strong> It&#8217;s the thing that does your work. When you issue a CREATE TABLE or a SELECT statement, it&#8217;s what returns data or stores it somewhere for later.</p><h3>Object store.</h3><p>It&#8217;s the cloud service where you can store an object. An object is anything: a blob.</p><h3>Catalog.</h3><p>In this context, a catalog knows what tables and views exist and where they are, and how you can fetch or write to them.</p><h3>Let&#8217;s talk internal versus external catalogs.</h3><p>An internal catalog is what you get by default in a system like Snowflake or SQL Server. An external catalog is more like another directory, often managed by a different system. As you connect more disparate platforms, you can&#8217;t assume one system controls everything.</p><h3>The complexity comes from duplication. How do you make namespaces unique? Can you plug in many external catalogs?</h3><p>Abstraction matters. A common pattern emerging is one&#8209;to&#8209;one mapping of an external catalog into a database. That pushes a move to a four&#8209;part namespace: catalog, database, schema, identifier. Spark moved toward this; Databricks Unity Catalog and Snowflake&#8209;style catalog link approaches are in this family.</p><h3>So the downside?</h3><p>The devil is in the details, especially metadata performance and resiliency. For example, information schema listing. Users expect listing tables to be fast and reliable. In a federated world, if listing tables takes five seconds, users blame the vendor they&#8217;re using&#8212;even if the external system is slow. DuckDB draws a line by not mixing external catalog tables into information schema listing today. Snowflake&#8217;s catalog link databases appear to cache or mirror metadata so it feels as performant as native tables.</p><h3>With catalog link databases, Snowflake is doing mirroring.</h3><p>Yes. Mirroring exists in different flavors across platforms. Delta is sometimes seen as &#8220;simpler&#8221; because metadata can live in object store, but as soon as you want multiple engines writing, you still need a real catalog.</p><h3>Sharing across multiple platforms adds another layer. What&#8217;s the state of platforms reading and writing to the same Iceberg catalog?</h3><p>There are phases of integration.</p><p>Phase one is the naive approach: you have Parquet and JSON in object storage, and an engine reads it. Reading is easier than writing. You can get a toy example working.</p><p>Then you run into versioning and &#8220;what&#8217;s latest.&#8221; The next phase is connecting to an Iceberg REST catalog so engines can ask for the latest table version without users thinking about paths.</p><p>Phase three is schema&#8209;scale: it&#8217;s never just one table. You need discovery of new tables, keeping schemas up to date, and eventually things like multi&#8209;table transactions.</p><h3>This maps to dbt Mesh and cross&#8209;platform mesh. Producer vs consumer.</h3><p>A consumer&#8209;led model requires the downstream team to create pointers (DDL) to external tables. It&#8217;s operationally messy. Producer&#8209;led is cleaner: the producer writes to the catalog and it&#8217;s just there, immediately queryable downstream.</p><h3>Are platforms there yet?</h3><p>Some support writing directly to external catalogs. When it works, it&#8217;s great, but there are still kinks. We&#8217;re retrofitting race cars designed for isolation to be interoperable without losing performance.</p><h3>Identity is one of the hairiest issues. Vended credentials.</h3><p>Vended credentials solve the &#8220;two keys&#8221; problem. You authenticate to the catalog, the catalog tells you where data lives, but then you need separate object store credentials to read files. Vended credentials means the catalog vends short&#8209;lived credentials so you can access the object store location without managing separate keys.</p><h3>That doesn&#8217;t solve user identity and grants.</h3><p>Correct. Vended credentials isn&#8217;t global authorization. Identity and access across platforms is still hard. Ideally you grant access once and it works everywhere, but enterprises have different identity providers and platforms have different permission models. Today, admins often have to configure grants separately in each platform.</p><h3>Is this mission creep?</h3><p>The goal is to reduce how many people have to think about storage details. Big tech had whole data platform teams solving reliability problems in Hive&#8209;era lakes. Iceberg reduces that toil dramatically, but the long tail is still auth, mirroring, and cross&#8209;platform governance.</p><h3>How does this reshape data teams?</h3><p>Analytics engineering abstracted a lot of work. Data engineering has also been simplified by replication/orchestration vendors. What remains is the open ecosystem complexity: identity, object store policies, and cross&#8209;platform connections. Many enterprises already have teams with these skills (infra as code, Terraform, Snowflake management), but others will need to grow into them.</p><h3>Are vendors embracing Iceberg in good faith?</h3><p>The goodwill and collaboration in the past 18 months feels unprecedented. We&#8217;re getting &#8220;more problems&#8221; because we solved prior ones. The industry aligning on standards feels like F1 teams standardizing components so they can innovate elsewhere.</p><h3>In your internal writeup about Iceberg, you quoted Wolf Hall: &#8220;The making of a treaty is the treaty. It doesn&#8217;t matter what the terms are, just that there are terms, it&#8217;s the goodwill that matters. When that runs out, the treaty is broken, whatever the terms say.&#8221; Explain the relevance here. </h3><p>When I joined dbt, it was taboo to mention one partner to another. Now vendors openly acknowledge mutual customers and invest in interoperability. On the Iceberg repo you see competitors collaborating on proposals. The goodwill is the standard.</p><h3>Wrap us up with three things you&#8217;re excited for next year.</h3><p>Push&#8209;based catalog updates so platforms can subscribe to changes rather than repeatedly listing and polling. Progress on the small files problem so Iceberg works better for smaller data too. And more platforms supporting writing directly to external catalogs, unlocking producer&#8209;led sharing and cross&#8209;platform mesh.</p><h2>Chapters</h2><p>00:00:00 &#8212; Intro: why open standards are accelerating</p><p>00:01:20 &#8212; What practitioners can expect from Iceberg in production</p><p>00:05:00 &#8212; Lightning round: query engine, object store, catalog</p><p>00:06:20 &#8212; Internal vs external catalogs</p><p>00:09:30 &#8212; The &#8220;four-part namespace&#8221; and catalog-link style abstractions</p><p>00:11:30 &#8212; The downside: metadata performance, resiliency, and caching</p><p>00:17:10 &#8212; Sharing across multiple platforms: reality and tradeoffs</p><p>00:19:10 &#8212; Iceberg integration phases (1: naive table, 2: REST catalog, 3: schema-scale)</p><p>00:24:10 &#8212; Producer vs consumer model and cross-platform mesh</p><p>00:29:10 &#8212; Identity and &#8220;vended credentials&#8221;: what it is and what it isn&#8217;t</p><p>00:33:30 &#8212; The hard unsolved part: grants and global identity across platforms</p><p>00:37:00 &#8212; Is this mission creep? What Iceberg is optimizing for</p><p>00:39:50 &#8212; How roles on data teams evolve in an open ecosystem</p><p>00:43:40 &#8212; Are vendors genuinely aligned? Why Anders is optimistic</p><p>00:46:50 &#8212; &#8220;The making of a treaty is the treaty&#8221;: goodwill as the standard</p><p>00:51:50 &#8212; Three things Anders is excited for next year</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 80,000 data teams use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/webinars/dbt-cloud-demos-with-experts?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Demo on-demand&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.getdbt.com/resources/webinars/dbt-cloud-demos-with-experts?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Demo on-demand</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Apache Iceberg and the catalog layer (w/ Russell Spitzer)]]></title><description><![CDATA[Everything you ever wanted to know about open table formats with a member of Apache Iceberg and Apache Polaris]]></description><link>https://roundup.getdbt.com/p/apache-iceberg-and-the-catalog-layer</link><guid isPermaLink="false">https://roundup.getdbt.com/p/apache-iceberg-and-the-catalog-layer</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 25 Jan 2026 13:59:27 GMT</pubDate><enclosure url="https://substackcdn.com/image/youtube/w_728,c_limit/wLH-vADSwaw" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this episode of The Analytics Engineering Podcast, Tristan talks with Russell Spitzer, a PMC member of Apache Iceberg and Apache Polaris and principal engineer at Snowflake. They discuss the evolution of open table formats and the catalog layer. They dig into how the Apache Software Foundation operates. And they explore where Iceberg and Polaris are headed. If you want to go deep on the tech behind open table formats, this is the conversation for you.</p><div><hr></div><p>A lot has changed in how data teams work over the past year. We&#8217;re collecting input for the <a href="https://forms.gle/KBU9smukSfiK1g4W7">2026 State of Analytics Engineering Report</a> to better understand what&#8217;s working, what&#8217;s hard, and what&#8217;s changing. If you&#8217;re in the middle of this work, your perspective would be valuable.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://forms.gle/Jc54NuP96qekHU9j7&quot;,&quot;text&quot;:&quot;Take the survey&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://forms.gle/Jc54NuP96qekHU9j7"><span>Take the survey</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://forms.gle/DPtgXva549hevZeH7" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3Xzm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png" width="728" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1080,&quot;width&quot;:1080,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://forms.gle/DPtgXva549hevZeH7&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3Xzm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://www.youtube.com/playlist?list=PL0QYlrC86xQm83Q9deiy4euEnbw8ceu3I">Youtube</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><div id="youtube2-wLH-vADSwaw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;wLH-vADSwaw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/wLH-vADSwaw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Key takeaways</h2><h3>Tristan Handy: You spend a lot of your time thinking about Iceberg and Polaris. Give the audience background on how you found yourself in this niche of high&#8209;volume analytic data file formats.</h3><p><strong>Russell Spitzer:</strong> It&#8217;s a bit random. I started at DataStax on Apache Cassandra as a test engineer and quickly got drawn into analytics. I saw big compute clusters and wanted to be involved. A coworker, Piotr, noticed Spark 0.9 and began a Spark&#8211;Cassandra connector. That got me into Spark. Over six to seven years I focused on moving data between Cassandra and Spark and into other systems. The interoperability problem across distributed compute frameworks was compelling.</p><p>This was pre&#8209;Apache Arrow and pre&#8209;table formats. We were just putting Parquet files everywhere and no one quite knew what they were doing. Pre&#8209;Spark, people explored DSLs like Apache Pig. Eventually the industry converged on SQL for end&#8209;user interfaces.</p><p>I later applied to Apple for the Spark team.</p><h3>Helping build Apple&#8217;s Spark infra, or working directly on Spark?</h3><p>Apple has an open-source Spark team and a Spark&#8209;as&#8209;infra team. I was trying to join the open source team, pushing Apple&#8217;s priorities into the project and supporting Spark as a service. During interviews, Anton&#8212;another Iceberg PMC&#8212;convinced the hiring manager I should join the data tables team, essentially Apple&#8217;s Apache Iceberg team.</p><p>They ambitiously planned to replace lots of internal systems with Iceberg. Iceberg existed but was early (Netflix started it around 2018/2019; I joined Apple in 2020). At Apple it was Iceberg all the time; convincing teams to move off older stacks, adopting open&#8209;source&#8209;as&#8209;a&#8209;service to save money, and getting onto ACID&#8209;capable foundations. We were successful.</p><h3>Migrations are hard. How did you make it accessible?</h3><p>We replaced complicated bespoke reliability fixes with Iceberg. In Hive/HDFS, small&#8209;file problems lead teams to write custom compaction and locking. Removing that toil is a big win. For big orgs, migration is a long&#8209;term investment with ongoing engineering cost. For smaller companies, the key is offloading runtime responsibilities&#8212;ideally to SaaS&#8212;so engineers aren&#8217;t in the loop. Open source limits lock&#8209;in so you can move between systems. Most companies are paid to deliver business value, not to build data infra. dbt is a great example of avoiding hand&#8209;rolled pipeline code. Same logic applies to table/file formats.</p><h3>Let&#8217;s talk Apache governance. What&#8217;s a PMC? How do projects run?</h3><p>Apache projects aren&#8217;t owned by one company. Influence is earned by contributing to the community. The PMC governs merges, releases, membership. People move companies; the project stays with them. The goal is to make the project broadly useful. There&#8217;s no CEO dictating roadmap and no company can change the license.</p><p>Most big projects&#8212;Spark, Kafka, Iceberg, Flink&#8212;are maintained by employees of companies with vested interests, but governance is consensus&#8209;driven. Vetoes are for technical issues (security, future&#8209;limiting design), not ideology.</p><h3>Is Iceberg for the top 20 tech companies or for everyone?</h3><p>Not everyone needs Iceberg. OLTP belongs elsewhere. But for analytics, we should move past raw Parquet partition trees with folder&#8209;name partitioning. In the Hadoop era, lakes were dumping grounds; schema evolution was painful. Many are still moving from CSV to Parquet. Over time, better encodings and table formats become default.</p><p>Decoupling compute and storage changes everything versus co&#8209;located HDFS. Defaults tuned for HDFS (like 128MB Parquet files) don&#8217;t always hold for S3. We want elastic storage and compute; no one wants to pay for compute because storage grew.</p><h3>Walk us through Iceberg versions.</h3><p>v1: transactional analytics&#8212;ACID commits instead of fragile Hive/HDFS patterns. v2: row&#8209;level operations&#8212;logical deletes via delete files so you don&#8217;t rewrite 10M&#8209;row data files to remove one row; later compaction physically purges (key for GDPR). v3: expanded types&#8212;geospatial and variant for semi&#8209;structured data; Variant was standardized across vendors and Parquet so everyone can write/read consistently.</p><p>v4: two thrusts&#8212;streaming and AI. Reduce commit latency, make retries faster under contention. Historically writes took 10&#8211;20 minutes, so commit latency didn&#8217;t matter. For streaming (writes every minute/five), it does. We&#8217;re evolving commit and REST catalog protocols so clients can specify intent (add these files, ensure these exist, then delete those) and let the catalog resolve conflicts server&#8209;side.</p><p>On AI: Iceberg doesn&#8217;t yet serve some vector/image&#8209;heavy patterns well. We&#8217;re exploring changes in Iceberg, Parquet, or both, without breaking existing tables.</p><h3>Talk about Polaris and the catalog layer.</h3><p>Polaris is an Apache incubator project (PPMC). Incubation proves we operate like an Apache project (community&#8209;driven, trademarks donated). Iceberg defines the REST catalog spec/client; Polaris implements a catalog that speaks that spec. Many of us work across projects (Parquet, Iceberg, Polaris), which helps align boundaries.</p><h3>Horizon, Polaris, external catalogs&#8212;what&#8217;s the story?</h3><p>We&#8217;re simplifying: Snowflake can act as an Iceberg REST catalog, or you can use an external REST catalog. External can be Polaris (managed by Snowflake or self&#8209;hosted) or another REST implementation. Interoperability means everything talks the same REST.</p><h3>What is Polaris trying to be best at?</h3><p>A broad, interoperable lakehouse catalog. It can act as a generic Spark catalog (HMS replacement) and aims to support multiple table/file formats. Architectural choices differ (KV vs. relational storage, where transactions live, policy enforcement vs. recording, identity integration). Polaris aims for base implementations that are pluggable&#8212;e.g., AWS/GCP/Microsoft identity.</p><h3>Identity and scope&#8212;where does the catalog stop?</h3><p>There&#8217;s a &#8220;business catalog&#8221; for discovery/listing versus a &#8220;system catalog&#8221; that must know table layout to govern access. Polaris can vend short&#8209;lived credentials for the exact directory of a table&#8217;s files for a load operation; that requires understanding layout. Purely relational metadata often needs to delegate that decision.</p><h3>Will identity/grants slow broad adoption?</h3><p>Possibly. But many once&#8209;complex things become default&#8212;compressed files, columnar formats, soon encryption. With collaboration (like Variant), we&#8217;ll land broadly accepted patterns.</p><h2>Chapters</h2><p>00:01:30 &#8212; Guest welcome and interview start</p><p>00:02:00 &#8212; Russell&#8217;s path: DataStax Cassandra, Spark connector, interoperability</p><p>00:05:20 &#8212; Joining Apple&#8217;s Iceberg team and early Iceberg momentum</p><p>00:06:20 &#8212; Why migrations resonated: replacing bespoke Hive/HDFS compaction/locking</p><p>00:09:10 &#8212; Apache governance 101: PMCs, consensus, and corporate influence</p><p>00:15:40 &#8212; How decisions land without votes; when vetoes apply</p><p>00:18:30 &#8212; Who needs Iceberg and where it fits</p><p>00:22:20 &#8212; Lake &#8594; lakehouse and warehouse &#8594; lakehouse in the cloud era</p><p>00:25:20 &#8212; Iceberg versions: v1 transactions, v2 row&#8209;level ops (GDPR), v3 types</p><p>00:28:10 &#8212; Standardizing Variant across vendors and Parquet</p><p>00:31:10 &#8212; Iceberg v4 goals: streaming commit/retry improvements and AI use cases</p><p>00:33:40 &#8212; Commit latency and server&#8209;side conflict resolution</p><p>00:37:20 &#8212; Polaris as an Apache incubating project (PPMC)</p><p>00:39:30 &#8212; Iceberg REST catalog spec and Polaris implementation</p><p>00:42:30 &#8212; Clarifying Snowflake Horizon, Polaris, and external REST catalogs</p><p>00:45:10 &#8212; What Polaris aims to be best at; pluggable identity providers</p><p>00:48:00 &#8212; Identity scope: business vs. system catalogs and credential vending</p><p>00:51:00 &#8212; Will identity/grants slow mass adoption?</p><p>00:52:50 &#8212; Wrap&#8209;up</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/webinars/dbt-cloud-demos-with-experts?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Demo on-demand&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.getdbt.com/resources/webinars/dbt-cloud-demos-with-experts?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Demo on-demand</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[AI agents and the data lake (w/ Lauren Anderson)]]></title><description><![CDATA[The head of Okta's enterprise data platform on why central governance and the semantic layer are so essential]]></description><link>https://roundup.getdbt.com/p/ai-agents-and-the-data-lake-w-lauren</link><guid isPermaLink="false">https://roundup.getdbt.com/p/ai-agents-and-the-data-lake-w-lauren</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 11 Jan 2026 14:03:05 GMT</pubDate><enclosure url="https://substackcdn.com/image/youtube/w_728,c_limit/sa-BJkM75TQ" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One of the interesting commonalities of AI and the data lake is that they both require new thinking around how we manage identity. For AI, the big question is how do agents interact with underlying data? For the data lake, the big question is how do we make open data stored outside the purview of any given data platform act like you&#8217;d expect?</p><p>In this episode of The Analytics Engineering Podcast, Tristan talks with Lauren Anderson, who leads the enterprise data platform at identity company Okta. Lauren discusses how identity sits at the center of two seismic shifts in data&#8212;AI agents and the open data lake&#8212;and why central governance and a shared semantic layer are critical. She lays out how analytics engineers and data engineers should divide responsibilities as agents begin to write a growing share of analytical queries. </p><div><hr></div><p>A lot has changed in how data teams work over the past year. We&#8217;re collecting input for the <a href="https://forms.gle/KBU9smukSfiK1g4W7">2026 State of Analytics Engineering Report</a> to better understand what&#8217;s working, what&#8217;s hard, and what&#8217;s changing. If you&#8217;re in the middle of this work, your perspective would be valuable.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://forms.gle/Jc54NuP96qekHU9j7&quot;,&quot;text&quot;:&quot;Take the survey&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://forms.gle/Jc54NuP96qekHU9j7"><span>Take the survey</span></a></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://forms.gle/DPtgXva549hevZeH7" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3Xzm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png" width="728" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1080,&quot;width&quot;:1080,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:&quot;https://forms.gle/DPtgXva549hevZeH7&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3Xzm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 424w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 848w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!3Xzm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c13cbe4-3bd5-4cc8-97ad-51fe6497ede0_1080x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:true,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" loading="lazy" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://www.youtube.com/playlist?list=PL0QYlrC86xQm83Q9deiy4euEnbw8ceu3I">Youtube</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><div id="youtube2-sa-BJkM75TQ" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;sa-BJkM75TQ&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/sa-BJkM75TQ?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Key takeaways</h2><h3>Tristan Handy: Before we dive into the current day, can you share a little bit about your background and how you came to the role that you&#8217;re in today.</h3><p><strong>Lauren Anderson:</strong> I&#8217;ve had a 20&#8209;something year career at this point. I have basically spent my entire career in analytics some way, but my first data job was at a big bank. I won&#8217;t name it. There&#8217;s only a few big banks you could probably guess. I worked for the finance org and I did compensation planning and administration, with a side of sales tracking and analytics. I was part database analyst, part customer support for people that made a lot more money than I did.</p><p>I was there for seven, seven and a half, eight years. Towards the end of it, I became the owner and creator and almost business architect for our brand&#8209;new sales tracking data warehouse. At a very young age, I got to think about how relational databases should come together for the outcome of both analytics and reporting&#8212;dashboards and whatnot&#8212;but also operations, which was paying compensation every month. It got me super excited about this world of data and being able to architect pipelines and the end&#8209;to&#8209;end flow for real&#8209;world outcomes.</p><h3>What do you think allowed you to be successful in that era? I often think the things that enabled success then aren&#8217;t the same as what make data folks successful today.</h3><p>When I took it over, we ran compensation out of an Access database. I was new, the person who designed it left, and there wasn&#8217;t much documentation. It worked the first month, then broke the second&#8212;right before a payroll deadline. I rebuilt it as a long series of SQL queries with inline comments and step&#8209;by&#8209;step checks that produced a clean file. That willingness to throw away the brittle thing and rebuild with clarity and documentation gave me early success. The meta&#8209;skills:ability to learn, take chances, and figure out the best path&#8212;still apply, but the technology is completely different now.</p><h3>You&#8217;ve split time at Okta into two stints. How would you characterize the work?</h3><p>Okta was my first truly B2B company. I realized quickly B2B data is my sweet spot. I love thinking about customers as businesses and how business users interact with our products and features. Okta data is complex&#8212;many products, features, and highly configurable use cases&#8212;especially with large customers. That variety is exciting. In simpler retail flows you see a lot of the same patterns; in B2B, the variety is the appeal.</p><h3>What&#8217;s your current role?</h3><p>I lead our enterprise data platform, engineering, and architecture function. For enterprise data used to make business decisions, we own ingestion into the warehouse, transformations, and delivery&#8212;dashboards, reverse ETL to third&#8209;party applications, other data stores, and internal apps.</p><h3>How big is the central function and how do you engage with the business?</h3><p>We&#8217;re about 50 people across data engineering and analytics/data science in a company south of 7,000 employees. We support every business unit. Engagement spans a maturity curve. One end is platform self&#8209;service: teams land data via approved connectors, build transformations in dbt on our implementation, and build dashboards in Tableau we administer. Governance and roles are defined centrally, and teams assign people to those roles. The other end is a white&#8209;glove model where we partner through the full lifecycle&#8212;question, discover existing assets, requirements, data work, build, interpretation, validation, and end&#8209;of&#8209;life of the data product. Our sweet spot is the middle: we own enterprise &#8220;gold&#8221; pipelines for company&#8209;level metrics&#8212;monitored and governed&#8212;while domains build and later graduate via a path&#8209;to&#8209;production under stronger governance.</p><h3>Okta is known for identity and security. How does security&#8209;first actually work in practice?</h3><p>Reinventing controls every time slows you down. We invest in repeatable frameworks. Any new source goes through third&#8209;party risk review, classification, and decisions on masking or exclusions. We help teams through that; after a couple times, they can engage directly with risk while we stay in the loop and monitor. As our classifications and expectations got clearer, review cycles shrank from weeks to days. It&#8217;s not all roses&#8212;it takes time&#8212;but we all operate as security practitioners. That shared mindset builds trust and reduces corner&#8209;cutting.</p><h3>How much do users need to know?</h3><p>We don&#8217;t expect everyone to know everything. We provide dbt frameworks and minimum testing standards, plus SMEs to guide teams. The culture is to ask when unsure.</p><h3>Will agents write more analytical queries than humans in the next 12&#8211;24 months?</h3><p>Macro, yes. For us, more like 24&#8211;36 months because we&#8217;re careful. The key is safe, ethical AI consistent with being a security company.</p><h3>How are you thinking about agent access?</h3><p>Central governance. Ideally, agents query centralized, agent&#8209;ready stores. Run governance once: policies, roles for users and for data, tracking and logging on a central plane. The semantic layer is essential. Creating semantic views must get easier and more automated, and semantics should inform policy application.</p><h3>Why are agents different from humans in access patterns?</h3><p>Row&#8209;level security to the extreme. Conversational intelligence data should be limited to what the requesting user can access. Aggregations could be broadly accessible with anonymization, but detailed content should remain constrained. You might also limit allowed functions on large unstructured objects. Identity for agents matters&#8212;Okta Secures AI looks at distinct identity patterns to secure agents across applications.</p><h3>Where are you with MCP and agent building?</h3><p>Early, building support and insight use cases. Progress is fast, but nothing broad in production yet.</p><h3>How should analytics engineers and data engineers participate?</h3><p>Analytics engineers should own semantics&#8212;tooling, vendor choices, onboarding use cases, and the shared business language. Data engineers should optimize for consistency and scale, notice overlap across agents, and provide a platform others can build on with confidence in governance and security.</p><h3>Will you standardize an agent development platform?</h3><p>Yes, in partnership with engineering and shared services. Our current pull skews to the business, so we&#8217;re leaning toward accessible, governed platforms that serve both business and engineering with central governance.</p><h3>Any assumptions you&#8217;re rethinking?</h3><p>Treating everything like a relational model. Many initial agent questions are intentionally simple, where speed and reasonable accuracy trump perfect sophistication. The important thing is to start, observe, and mature.</p><h2>Chapters</h2><p>00:02:28 &#8212; From bank analytics to owning a sales DW</p><p>00:05:00 &#8212; Rebuilding brittle Access &#8594; SQL with documented checks</p><p>00:08:30 &#8212; Ops accountability then vs. optimization today</p><p>00:11:00 &#8212; TripIt, marketing analytics, and moving into tech</p><p>00:13:14 &#8212; Why B2B data became Lauren&#8217;s sweet spot</p><p>00:16:00 &#8212; Current role: ingestion &#8594; transform &#8594; delivery at Okta</p><p>00:18:10 &#8212; Operating models across business units and the path to production</p><p>00:22:20 &#8212; Security-first in practice: repeatable frameworks over friction</p><p>00:24:23 &#8212; Third&#8209;party risk, classification, and shrinking review cycles</p><p>00:28:00 &#8212; Policies, masking, and the need for a central governance plane</p><p>00:30:20 &#8212; Frameworks for dbt, testing, and SME guidance</p><p>00:32:11 &#8212; Will agents outwrite humans? Macro yes; Okta timeline nuance</p><p>00:33:48 &#8212; Central governance and agent access patterns</p><p>00:37:19 &#8212; Semantic layer as bridge and policy carrier</p><p>00:41:00 &#8212; Function limits on unstructured data and Okta Secures AI</p><p>00:42:35 &#8212; Early MCP experimentation and support use cases</p><p>00:43:03 &#8212; Roles: analytics engineers (semantics) and data engineers (scale)</p><p>00:46:10 &#8212; Enabling an org-wide agent platform with shared governance</p><p>00:47:43 &#8212; Solve governance once, serve business and engineering</p><p>00:49:30 &#8212; Simpler questions first; rethinking relational assumptions</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/webinars/dbt-cloud-demos-with-experts?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Demo on-demand&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.getdbt.com/resources/webinars/dbt-cloud-demos-with-experts?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Demo on-demand</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Inside Snowflake’s AI roadmap (w/ Chris Child)]]></title><description><![CDATA[Snowflake's VP of Product Management on the vision for open table formats, governed agents, and the future of the data engineer]]></description><link>https://roundup.getdbt.com/p/inside-snowflakes-ai-roadmap-w-chris</link><guid isPermaLink="false">https://roundup.getdbt.com/p/inside-snowflakes-ai-roadmap-w-chris</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 14 Dec 2025 14:06:35 GMT</pubDate><enclosure url="https://substackcdn.com/image/youtube/w_728,c_limit/5Yo0chBWt2c" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>This season of The Analytics Engineering Podcast is focused on how the current data landscape is impacting the developer experience. Snowflake plays a major role in what that developer experience looks like. </p><p>In this episode, Snowflake VP of Product Management Chris Child joins Tristan to unpack Snowflake&#8217;s AI roadmap and what it means for data teams. They discuss the evolution from Snowpark to <a href="https://docs.getdbt.com/blog/semantic-layer-cortex">Cortex</a> and <a href="https://www.getdbt.com/blog/what-is-snowflake-intelligence-anyway">Snowflake Intelligence</a>, how to <a href="https://www.getdbt.com/blog/bring-structured-context-to-agentic-data-development-with-dbt">govern agents </a>with row- and column-level controls, and why Snowflake is investing in <a href="https://www.getdbt.com/blog/iceberg-give-it-a-rest">Apache Iceberg</a> and the <a href="https://www.snowflake.com/en/blog/open-semantic-interchange-ai-standard/">Open Semantic Interchange initiative</a>. dbt Labs recently open sourced <a href="https://www.getdbt.com/blog/open-source-metricflow-governed-metrics">MetricsFlow</a>, the technology that powers the dbt Semantic Layer, to align with the goals of OSI. </p><p>Chris also shares a vision for the next five years of data engineering: fewer bespoke pipelines, more standardization and semantics, and a bigger focus on business context and data products.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://docs.getdbt.com/docs/install-dbt-extension&quot;,&quot;text&quot;:&quot;Check out the dbt VS Code extension&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://docs.getdbt.com/docs/install-dbt-extension"><span>Check out the dbt VS Code extension</span></a></p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://www.youtube.com/playlist?list=PL0QYlrC86xQm83Q9deiy4euEnbw8ceu3I">Youtube</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><div id="youtube2-5Yo0chBWt2c" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;5Yo0chBWt2c&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/5Yo0chBWt2c?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Key takeaways</h2><h3>Tristan Handy: Where have you spent your time professionally?</h3><p><strong>Chris Child:</strong> I didn&#8217;t end up in data on purpose. I found myself here through a series of hops. I was working at Redpoint Ventures and got excited by a company we invested in, RelateIQ. I left to join RelateIQ, building an intelligent CRM. We captured emails and meetings and built profiles of everyone you interacted with. We were acquired by Salesforce. Looking at what sales teams needed, I realized they also needed product usage data, marketing data, and campaign data, with a platform to pull it all together. That led me to Segment. I joined when it was about 50 people. Segment was mostly analytics.js then, loading different JavaScript on your webpage for tracking. We had just built the first warehouse connector to Redshift and got huge usage sending click and user data to Redshift.</p><h3>The original Redshift connector was a nightmare to work with.</h3><p>Like many startup things, one engineer built it in a week. Suddenly a ton of people used it, and enterprise customers depended on it. We had to rebuild it several times. You could see the future there. Folks I worked with went on to start companies like Census and Hightouch, thinking the CDP should be built on top of the warehouse, which Segment evolved toward. We also built a Snowflake connector because customers demanded it in addition to Redshift.</p><h3>It&#8217;s funny to think back a decade to how small Snowflake was.</h3><p>A couple customers demanded it; we built it, and we were sending a ton of data. That led to the realization that a customer data platform is one instance of a data warehouse, and there are others you need. Seeing how fast Snowflake was growing, I wanted to build the next layer of infrastructure. </p><p>I joined Snowflake seven and a half years ago. I&#8217;ve had three key roles. First, I built areas of the product: the UI, billing, product-led growth engines and free trial infrastructure, and application capabilities for connecting into and building on Snowflake. After Sridhar became CEO, he asked me to reconnect product and sales by leading solutions engineering, reporting to the CRO. Leading a global technical seller org was very different for a product person, but it helped align teams at scale. </p><p>About eight months ago, I returned to lead data engineering: how people bring data into Snowflake, how they transform it&#8212;spending a lot of time with dbt&#8212;and work around Iceberg and interoperability for worlds where not all data sits in Snowflake.</p><h3>I didn&#8217;t realize the path started in investing. Are you a finance person way back?</h3><p>My undergrad is in computer science. I started programming in fifth grade on an Apple IIe, learned C before high school, and followed that thread. In college I noticed business folks often made the decisions. I wanted to learn that side. After college I joined a consulting firm, then private equity, then an MBA. I realized I didn&#8217;t want to be a finance person. I moved to venture as a bridge to building products, but I wanted to build, so I jumped into operating roles.</p><h3>Tell the story of Snowflake and AI. In the 2010s there was huge demand for easier, scalable, cloud-oriented data solutions. Then 2022 happened, ChatGPT launched, and the world changed. How did Snowflake respond, and where are you today?</h3><p>Even pre&#8209;2022 we saw customers putting their most important business data into Snowflake, then pulling data out for things they couldn&#8217;t do inside: training ML models and other analyses that SQL wasn&#8217;t a great fit for. Customers told us they didn&#8217;t like losing governance and lineage when data left. We invested in ways to bring more of that work to Snowflake. </p><p>Snowpark was the first big step: a runtime for non&#8209;SQL code (Python, Java, Scala) with APIs inspired by Spark, plus capabilities like forecasting. It&#8217;s great for some workloads, but most customers don&#8217;t train most ML models inside Snowflake yet. We also acquired Applica for document extraction using early LLM techniques, and Neeva for web search based on LLM approaches. </p><p>When ChatGPT arrived, we saw two major influences. First, people wanted to chat with data they&#8217;d brought into Snowflake and transformed with dbt. That&#8217;s hard because LLMs are great with unstructured data and less great at turning business questions into correct SQL. Second, LLMs are very good at writing code, including Python and even dbt code. They&#8217;re not perfect for data engineering code yet, but they help. </p><p>Our goal is to help customers activate important enterprise data safely in AI models, deploy agents at scale under existing governance, and keep up with exploding data volumes without 10x headcount.</p><h3>What are the key product pieces&#8212;Cortex, Snowflake Intelligence, etc.&#8212;in the Snowflake AI stack?</h3><p>First, you need a great data foundation. That isn&#8217;t new: get the data in one place, apply good governance and permissions, know your data, tag PII, and raise the standard of care. </p><p>AI raises the bar because agents can expose sensitive data faster than dashboards. OSI (Open Semantic Interchange) work is part of this; LLMs need explicit semantics and cataloging they can consume, not tacit knowledge hidden in downstream tools. </p><p>Companies with strong hygiene move faster with AI. Roles matter; if a product manager role has access to certain rows and columns, an agent acting within that role can safely answer questions. Agents can run inside or outside Snowflake, but should assume appropriate roles when querying.</p><p>On the AI stack, after the data foundation, Cortex provides higher&#8209;level APIs for unstructured processing, RAG, and structured processing. You can choose models (OpenAI, Anthropic, Mistral, Gemini, Llama, etc.), but most folks don&#8217;t want to manage prompts and GPUs. Cortex AI SQL lets you express intent like sentiment filters or fuzzy joins. It&#8217;s powerful for exploration but non&#8209;deterministic, so you need care in production. Costs map to tokens at higher abstractions, with budgets and guardrails similar to variable compute in the cloud.</p><p>At the top, Snowflake Intelligence is a UI and agent framework. You define agents with access to specific datasets and semantic models, plus gold queries and usage guidance. It looks like a chat interface over your governed data. Inside Snowflake, we&#8217;ve deployed a GTM assistant that blends product usage, Salesforce, notes, docs, and content&#8212;structured and unstructured&#8212;respecting row&#8209;level security for every seller while giving leaders broader access.</p><h3>Let&#8217;s talk open formats and Iceberg. Why lean in when it opens up the data?</h3><p>Our aim isn&#8217;t to lock up data, it&#8217;s to help customers get value. Snowflake began as a reaction to Hadoop&#8212;betting on SQL at cloud scale with our own formats and catalog because they didn&#8217;t exist then. Those proprietary pieces let us evolve quickly. Iceberg is now almost as good, and we&#8217;re contributing to make it better. </p><p>Openness is a win for customers and expands the universe of data Snowflake can query, run Cortex on, and power Intelligence with. The tradeoff is standards move slower. Variant type support is a good example&#8212;we contributed our approach and shepherded it into the v3 spec. </p><p>Next up, the community is wrestling with fine&#8209;grained access control beyond table&#8209;level policies. It&#8217;s hard and will take time, but the outcome should be better for everyone.</p><h3>Give us your view on the future of data engineering.</h3><p>Data volume is exploding, including unstructured data that&#8217;s now usable. You can&#8217;t hand&#8209;build every pipeline. Demand is also exploding as agents query more things in more ways. Teams must operate at a higher level: automate, standardize, and reduce bespoke pipelines. </p><p>Expect more shared semantic models across consumers and packaged semantics coming from systems like SAP. You&#8217;ll also build data&#8209;engineering agents to do work and monitor pipelines. The role looks more like architect and manager, allocating budgets, deduplicating work, and&#8212;most importantly&#8212;deeply understanding the business. The best data engineers shift from code output to data products, with clear semantics and context.</p><h3>Talk more about context.</h3><p>The day&#8209;to&#8209;day activity shifts, but the output is still data products. Great data products come with instructions, definitions, lineage, quality expectations, and how to get correct answers to common questions. </p><p>We need that context captured where work happens&#8212;models, visualization, quality systems&#8212;and made available everywhere: catalogs, agents, and UIs. As you build, you should also document, and those semantics should flow consistently into tools like Snowflake Intelligence so agents can reason correctly. </p><p>A big part of the challenge is selecting just&#8209;enough context per question.</p><h2>Chapters</h2><ul><li><p>00:01:50 &#8212; Chris&#8217;s path: RelateIQ, Segment, Snowflake</p></li><li><p>00:05:40 &#8212; Roles at Snowflake: product, solutions engineering, data engineering</p></li><li><p>00:09:00 &#8212; Snowflake and AI: foundations before ChatGPT</p></li><li><p>00:11:40 &#8212; Why keep ML and non-SQL work closer to governed data</p></li><li><p>00:13:40 &#8212; Applica and Neeva acquisitions, enterprise search context</p></li><li><p>00:14:50 &#8212; Two big AI influences: chat with data and code generation</p></li><li><p>00:16:50 &#8212; Scaling agents while preserving governance and cost controls</p></li><li><p>00:18:40 &#8212; Why governance must live at the data layer (roles, rows, columns)</p></li><li><p>00:22:00 &#8212; Inside vs. outside Snowflake: how agents assume roles</p></li><li><p>00:23:02 &#8212; Cortex: higher-level APIs over many LLMs</p></li><li><p>00:24:06 &#8212; AI SQL: joins/where by intent and the non-determinism tradeoff</p></li><li><p>00:27:40 &#8212; Cost models, tokens, and guardrails</p></li><li><p>00:29:10 &#8212; Snowflake Intelligence: agents over a governed foundation</p></li><li><p>00:32:10 &#8212; Open formats and Iceberg: Why Snowflake leaned in</p></li><li><p>00:36:00 &#8212; Standards tradeoffs: variant type and community progress</p></li><li><p>00:38:40 &#8212; Fine-grained access control for Iceberg: thorny but necessary</p></li><li><p>00:40:40 &#8212; The future of data engineering: scale, unstructured data, agents</p></li><li><p>00:43:20 &#8212; No more bespoke pipelines; standardized models, and semantics</p></li><li><p>00:44:50 &#8212; Data engineers as architects and business partners</p></li><li><p>00:50:00 &#8212; Code vs. context: data products and shared semantics</p></li><li><p>00:53:10 &#8212; Capturing context where work happens (models, viz, quality)</p></li><li><p>00:55:00 &#8212; Selecting just enough context for agent reasoning</p></li><li><p>00:56:30 &#8212; Closing</p></li></ul><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Building a multimodal lakehouse for AI (w/ Chang She)]]></title><description><![CDATA[The CEO of LanceDB and Tristan go deep into the bridge between analytics and AI engineering]]></description><link>https://roundup.getdbt.com/p/building-a-multimodal-lakehouse-for</link><guid isPermaLink="false">https://roundup.getdbt.com/p/building-a-multimodal-lakehouse-for</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 23 Nov 2025 14:03:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/youtube/w_728,c_limit/R5RW3LZIAO8" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Welcome back to The Analytics Engineering Podcast! Last season, we explored a host of topics on the developer experience (<a href="https://www.youtube.com/watch?v=WidQLYon2_I&amp;t=5s">something the dbt Labs crew has been pretty vocal on recently</a>). This season, we&#8217;re expanding that theme to look at how the current data landscape is impacting the developer experience. <a href="https://www.getdbt.com/blog/what-is-open-data-infrastructure">Open data infrastructure</a> is on the rise; AI is pushing teams to rethink how data is modeled, governed, and scaled; and the developer experience is evolving.</p><p>In this episode, Tristan Handy sits down with Chang She&#8212;a co-creator of Pandas and now CEO of LanceDB&#8212;to explore the convergence of analytics and AI engineering.</p><p>The team at LanceDB is rebuilding the data lake from the ground up with AI as a first principle, starting with a new AI-native file format called Lance and building upward from there.</p><p>Tristan traces Chang&#8217;s journey as one of the original contributors to the pandas library to building a new infrastructure layer for AI-native data. Learn why vector databases alone aren&#8217;t enough, why agents require new architecture, and how LanceDB is building a AI lakehouse for the future.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://docs.getdbt.com/docs/install-dbt-extension&quot;,&quot;text&quot;:&quot;Check out the dbt VS Code extension&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://docs.getdbt.com/docs/install-dbt-extension"><span>Check out the dbt VS Code extension</span></a></p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://www.youtube.com/playlist?list=PL0QYlrC86xQm83Q9deiy4euEnbw8ceu3I">Youtube</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><div id="youtube2-R5RW3LZIAO8" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;R5RW3LZIAO8&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/R5RW3LZIAO8?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Key takeaways</h2><h3>Tristan Handy: You&#8217;re the founder and creator of the Lance file format and LanceDB. Before diving into vector search and vector databases, tell us about your background. </h3><p><strong>Chang She:</strong> I love talking to analytics engineers because that&#8217;s my background. I started about 20 years ago in quantitative finance. As a junior analyst, you do a lot of data engineering and analytics, which got me into open-source Python. I became one of the co-authors of the pandas library&#8212;initially to solve my own problem of not wanting to do analytics engineering in Java or VBScript.</p><h3>You worked for a hedge fund?</h3><p>Yes, AQR.</p><h3>Did they know you were contributing to pandas? Hedge funds aren&#8217;t known for open source.</h3><p>My roommate and colleague at the time was Wes McKinney. He showed me a proprietary Python library he was working on. It was life-changing. I started using and contributing. He spent about six months convincing the fund to open-source it. This was around 2010, and they were ahead of the industry in that respect.</p><h3> I didn&#8217;t know pandas started at AQR. That&#8217;s fascinating. So much of your circa-2010 analytics work was done in early pandas?</h3><p>Exactly. We went through several iterations, even debated the name. Because it was a hedge fund, there was a lot of econometrics and &#8220;panel data,&#8221; so Wes named it &#8220;pandas&#8221; for panel data analysis.</p><h3>That origin story isn&#8217;t widely known. You then founded two companies, sold one to Cloudera, and were there during an interesting time.</h3><p>Wes and I created DataPad&#8212;cloud BI before cloud BI really took off&#8212;and sold it to Cloudera. I spent about four and a half years in the Hadoop &#8220;big data&#8221; world, where I met my co-founder. He worked on HDFS at Cloudera, and several ex-Cloudera folks are at LanceDB today. After that I moved into machine learning at Tubi TV, working on recommender systems, ML serving, and experimentation/AB testing. That exposed me to embeddings. We dealt with videos, poster art images, and synopses&#8212;data that doesn&#8217;t fit neatly into pandas or even Spark data frames. That inspired me to build better infrastructure for these data types&#8212;what we now call &#8220;classical&#8221; machine learning&#8212;which led to LanceDB.</p><h3>So that&#8217;s our bridge to vectors. You experienced these problems at Tubi, then founded the company. And Tubi used dbt?</h3><p>Heavily. Thank you for creating it&#8212;it was critical to our stack.</p><h3>Give us a non-technical intro: what are vectors used for?</h3><p>Many people focus on the latest models and techniques. My perspective: everyone has access to similar models&#8212;your differentiation comes from your data and how effectively you connect data to AI. Vectors are a way to represent any kind of data in a form models understand: high-dimensional arrays of floating-point numbers&#8212;1,500, 3,000 dimensions, etc. Early statistical models might have a few interpretable dimensions; now you can have thousands where individual dimensions aren&#8217;t necessarily interpretable, but the space captures semantics.</p><p>Beyond RAG, vectors power internal model representations, recommender systems, and personalization&#8212;the original mainstream use case.</p><h3>Search is also a good use case. How is vector search different from full-text search or Command-F?</h3><p>Full-text search (e.g., Elasticsearch) returns documents containing the exact terms you searched. If you search for &#8220;customer,&#8221; it finds &#8220;customer/customers,&#8221; but might miss &#8220;user,&#8221; &#8220;adopter,&#8221; &#8220;organization,&#8221; etc. Vector search uses dense representations where semantically similar words and documents live near each other in high-dimensional space. Search for &#8220;customer,&#8221; and you get results that include semantically related terms.</p><h3>Would you combine vector and full-text search?</h3><p>Yes&#8212;hybrid search. Early RAG demos often used pure vector search for speed. Now enterprises need production-grade relevance. Many combine keyword and vector search with a re-ranking step to reach higher precision/recall.</p><h3>Early RAG pipelines often chunk text, embed, and call it done. But more thoughtful pipelines do something closer to feature engineering, right?</h3><p>Absolutely. Thought goes into what you feed the embedding model. For example: add a document- or section-level summary alongside each chunk before embedding; include multimodal features&#8212;artistic descriptions, literal captions, tags; create multiple embedding columns (e.g., different prompts/modalities) and search across them with re-ranking. High-quality retrieval requires feature-engineering-like decisions before embedding.</p><h3>Let&#8217;s talk vector file formats (Lance) and vector databases (LanceDB). My crude belief: a vector database is a standard database with additional indexes. True?</h3><p>Not wrong, but my hot take: with Lance and LanceDB, we&#8217;re building a lakehouse for multimodal data that includes vectors. Many &#8220;vector databases&#8221; are optimized only for vectors and struggle with other data types and workloads. The category needs to evolve&#8212;either toward new-generation search engines or new-generation lakehouses. We set out from day one to build the broader lakehouse, not just a vector index.</p><h3>Outline your AI-enabled data lake vision. I&#8217;m familiar with Snowflake and Databricks&#8217; lakehouse. How do you see the world differently?</h3><p>We assumed everyone would use Parquet and tried for months to support AI workloads&#8212;search, training, preprocessing&#8212;on it. We couldn&#8217;t make it work well. Talking to computer-vision and ML practitioners, no one had something effective. That gave us confidence to build a new format.</p><p>In AI you manage vectors, long documents, images, and videos. The first problem is storage. With Parquet, mixing wide blob columns with narrow metadata columns leads to out-of-memory issues due to row-group design. If you shrink row groups to fit blobs, read performance tanks.</p><p>Even once data is in Parquet, AI needs random access and secondary indexes. Parquet doesn&#8217;t support efficient random row access: retrieving scattered rows forces reading entire row groups. With media, that&#8217;s prohibitively expensive&#8212;both for search and for training (e.g., global shuffle). Data evolution is also hard: with table formats like Iceberg, backfills often mean copying entire datasets. Copying petabytes of media is a non-starter. These issues motivated Lance.</p><h3>I have a good mental model of Parquet with structured data. With images or video, do you put them in blob columns?</h3><p>Yes. We use Apache Arrow types. Images/audio/video are large binary columns. Vectors are fixed-width list columns (e.g., 1,536-dimensional). But Parquet&#8217;s row-group mechanics and lack of random access make these workloads painful.</p><h3>So Lance was the first thing you built. It has solid traction on GitHub. Who uses a file format&#8212;users or vendors?</h3><p>Both. Frontier labs use Lance to store training data&#8212;e.g., for image/video generation&#8212;replacing stacks like TFRecords, WebDataset, Parquet, and BigQuery. Large tech companies and vendors also build on Lance: Databricks, Tencent, Alibaba, Netflix, NVIDIA, Uber, among others.</p><h3>Databricks uses Lance?</h3><p>For parts of their AI-specific offerings.</p><h3>You&#8217;ve raised several rounds&#8212;the format is Apache-2 licensed. How do you commercialize?</h3><p>Our commercial offering is a data platform for large-scale AI production: vector search, data preprocessing, training/serving cache, and an analytics engine for curation and exploration. It supports ML training workflows and AI application development, solving the hard distributed-systems problems along the path. We partner closely with big vendors; we&#8217;re generally not competitive because goals and customer bases differ. Cloud providers seek platform consumption; we focus on an AI-optimized data platform for specific workloads and users.</p><h3>The commercial product is called LanceDB, but you prefer to position it not just as a database.</h3><p>Right&#8212;we&#8217;re an AI-native data platform/lakehouse for multimodal data, with Lance as the common format.</p><h3>How does this space play out over the next two to three years?</h3><p>Two big predictions. First, multimodal will be 100&#215; bigger&#8212;more usage and more data. Audio is exploding; video generation is resurging; robotics is next. Second, our data infrastructure isn&#8217;t ready for agents driving search and retrieval.</p><h3>Let&#8217;s unpack both. On multimodal: unlike structured analytics, where every company needs it, multimodal workloads seem concentrated. Do all enterprises really need this?</h3><p>I think every enterprise becomes multimodal. Take insurance: tons of documents to digitize, extract, search, and analyze; drones capturing images/video to assess risk and improvements over time. Existing businesses become more efficient; AI-native entrants gain structural advantages. Multimodal data underpins both.</p><h3>It&#8217;s a heavy lift. Will every Fortune 500 insurer build these capabilities in-house, or will vendors package them?</h3><p>Likely both&#8212;just like analytics engineering emerged as a role, with adjacent talent re-skilling. We see the same with AI engineering.</p><h3>What titles are hands-on with your product?</h3><p>AI researchers and AI engineers. Many app developers building AI features now carry the &#8220;AI engineer&#8221; title.</p><h3>On agents: how do their access patterns change platform requirements?</h3><p>RAG was one-shot: ask, retrieve, answer. Agents iterate: they decompose problems into sub-questions, refine queries and results, and run many steps in parallel. Load skyrockets&#8212;humans type slowly; agents can issue hundreds of queries simultaneously. Queries are more varied and selective, and agents are creative in combining modalities and sources: schemas, SQL over structured data, prior analyses and charts, document stores, image/video metadata, etc.</p><p>Traditional vector databases aren&#8217;t designed for this breadth and scale. If you bolt together multiple specialized systems, your &#8220;agent stack&#8221; balloons into a maintenance nightmare. Our approach: put all data in one place with a single system that supports vector search, keyword search, filters, key-value lookups, re-ranking, analytics, and efficient random access&#8212;on top of an AI-native file format (Lance).</p><h3>For listeners whose curiosity is piqued, any resources you recommend?</h3><p><strong>Chang She:</strong> Yes&#8212;our blog series by Weston Pace, the tech lead for Lance format. It dives into encodings, I/O, and has great reads for analytics engineers: <a href="http://lancedb.com/blog">lancedb.com/blog</a> .</p><h2>Chapters</h2><ul><li><p>00:00 &#8211; Intro: Analytics meets AI</p></li><li><p>03:20 &#8211; Chang&#8217;s background and how Pandas began</p></li><li><p>06:40 &#8211; Lessons from Cloudera and metadata</p></li><li><p>08:30 &#8211; Multimodal data and LanceDB&#8217;s origin story</p></li><li><p>10:00 &#8211; Why vector search matters (beyond RAG)</p></li><li><p>12:00 &#8211; What are vectors and why do we use them?</p></li><li><p>15:00 &#8211; Full-text vs vector search</p></li><li><p>18:00 &#8211; Feature engineering in AI use cases</p></li><li><p>21:15 &#8211; Lance format</p></li><li><p>28:00 &#8211; Storage, scale, and the problem with Parquet</p></li><li><p>35:30 &#8211; Building a business on open source</p></li><li><p>41:00 &#8211; Two big bets: multimodal data and agents</p></li><li><p>46:00 &#8211; Every company will become multimodal</p></li><li><p>50:00 &#8211; Agent access patterns will redefine data</p></li><li><p>54:00 &#8211; Why dbt-style workflows matter now more than ever</p></li></ul><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Agentic coding in analytics engineering (w/ Mikkel Dengsøe)]]></title><description><![CDATA[The cofounder of SYNQ discusses his tests (and tips) with agentic coding tools]]></description><link>https://roundup.getdbt.com/p/agentic-coding-in-analytics-engineering</link><guid isPermaLink="false">https://roundup.getdbt.com/p/agentic-coding-in-analytics-engineering</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 07 Sep 2025 12:01:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/555761fd-daa8-47e7-a907-79541a9e3860_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://coalesce.getdbt.com/event/21662b38-2c17-4c10-9dd7-964fd652ab44/summary/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q3-2026_coalesce-2025_aw&amp;utm_content=coalesce____&amp;utm_term=all_all__" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C5Cy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1203274,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://coalesce.getdbt.com/event/21662b38-2c17-4c10-9dd7-964fd652ab44/summary/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q3-2026_coalesce-2025_aw&amp;utm_content=coalesce____&amp;utm_term=all_all__&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/171688472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C5Cy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>What does agentic coding look like in analytics engineering? Mikkel Dengs&#248;e, co-founder at SYNQ, recently <a href="https://medium.com/@mikldd/using-ai-for-data-modeling-in-dbt-975838054cb1">wrote</a> a <a href="https://medium.com/@mikldd/using-ai-to-build-a-robust-testing-framework-4e034dfd014f">series</a> of <a href="https://medium.com/@mikldd/using-omnis-ai-assistant-on-the-semantic-layer-0572f997451d">posts</a> on his experiences as an analytics engineer with agentic coding tools. In this episode of The Analytics Engineering Podcast, he walks through a hands-on project using Cursor, the <a href="https://www.getdbt.com/product/fusion">dbt Fusion engine</a>, the <a href="https://www.getdbt.com/blog/mcp">dbt MCP server</a>, Omni&#8217;s AI assistant, and Snowflake.</p><p>Tristan and Mikkel cover where agents shine (staging, unit tests, lineage-aware checks), where they&#8217;re risky (BI chat for non-experts), and how observability is shifting from dashboards to root-cause explanations delivered to the right person at the right time. Along the way: practical prompts, why &#8220;one model at a time&#8221; keeps you in control, and a testing philosophy that avoids alert fatigue while catching what matters.</p><p><strong><a href="https://coalesce.getdbt.com/event/21662b38-2c17-4c10-9dd7-964fd652ab44/summary/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q3-2026_coalesce-2025_aw&amp;utm_content=coalesce____&amp;utm_term=all_all__">To see real-world use cases of agentic coding and to learn directly from data and AI leaders, join us at Coalesce 2025 in Las Vegas, Oct. 13-16</a></strong>.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h2>Key takeaways</h2><h3>Can you talk a little bit about your background?</h3><p><strong>Mikkel Dengs&#248;e:</strong> Yeah, so I can start from the beginning. I've been in data for, I think it's coming up to 15 years now, and started my career in data at a Danish shipping company, which was very much zero to one. When I came in, there was no data warehouse, and the only way we could know how many containers were shipped was by an IT guy pulling that out of the system every six months. I then spent two years there building up their data warehouse on SQL Server, which was super fun. After that, I spent five years at Google, which was a very different gear.</p><h3>That's a natural transition. Just global shipping company straight to Google.</h3><p>Exactly. And that was very much a hundred-to-end where, in my case, I worked with the ads data and you get a perfectly curated data table that you can work with and everything kind of works. Then after that I joined a company called Monzo. For those who are not familiar, it's a scaling fintech out of the UK and that was very much the one to a hundred. When I joined we were 30 data people, but scaled to a hundred over two years. We had 10,000 dbt models and we built every internal tool under the sun for dbt. Super interesting. And then three and a half years ago I went on to found SYNQ alongside Peter and Steve, which is a data observability platform.</p><h3>Tell us a little bit more about SYNQ.</h3><p>We are a data platform that primarily works with companies using tools like dbt already, but have issues going from important data to business-critical data. That might be customer-facing dashboards, machine learning models, or something else. They want better monitoring&#8212;we often deploy anomaly monitors&#8212;and they also want workflows such as incident management for when things go wrong. We were founded in 2022, so now we're in early stages of working with scale-ups and startups, and now also onboarding enterprises and larger companies. It's been a fun journey.</p><h3>In your series of blog posts, you went through the modern data stack and said, &#8220;What's the most current version of this tool and how effectively can I AI-ify that?&#8221; Whether that's using Cursor to build dbt models or using the agent experience inside of Omni&#8212;what made you decide to get into this and write about it?</h3><p>The first part of it is just: it's super fun to tinker with these tools and try them out. It's magic. And we were also building an MCP server at SYNQ, so I had a lot of interest in seeing how it works with others and what we can learn. It was also driven by being able to have conversations with our customers. When they ask about it, being able to speak from the point of view of having actually tried this and seen what works and what doesn't.</p><h3>The early days of using Redshift were such a visceral experience relative to what came before. If I hadn't interacted with it directly, I wouldn't have understood how big a state change cloud data was. This feels like another one of those moments: if you don't have hands-on experience, you're not going to really get it. Fair?</h3><p>Spot on. And I think pretty much every data team should be doing this unless they have a very good reason not to. The risk and the stakes can be pretty low if you use it for internal workflows like data modeling and writing tests. You're still in control. I recommend everybody do it.</p><h3>What tasks did you try to accomplish?</h3><p>It's three different blog posts: the data modeling part, the testing part, and then exposing it in Omni's AI agent where people can ask questions about the data. There's a fourth post: once the data is live, how can you use the SYNQ MCP to do things like root-cause analysis and planning changes. I started with data modeling. I had raw data from different JSON sources, some XMLs, some profiles&#8212;extracted and put into Snowflake&#8212;and then did the data model.</p><h3>So the data was already loaded into Snowflake?</h3><p>Yeah, exactly. For the data modeling, I started from the sources and then worked through staging, marts, and finally metrics using the semantic layer. Each step looks a little different when you use AI tools because the behavior differs. In terms of tooling, I used Cursor with the dbt-MCP plugged in. If you're not familiar, dbt-MCP lets you, via prompt, interact with dbt tools&#8212;execute <code>dbt build</code>, get models, or get everything upstream of a given model&#8212;so you can chain work without explicitly doing it.</p><h3>Cursor + dbt-MCP. What model did you use?</h3><p>I just used the default in Cursor, which I believe is Claude. There's an important distinction: Cursor is really good at writing code, but it can't execute queries on your behalf. If you want to extract raw data and query Snowflake to get rows out, you have to do that in Claude Desktop. That became key. Early on, as I built models, the first thing I did was get a snapshot of sample data from Snowflake&#8212;10,000 rows of a source. I fed that into Cursor and said, &#8220;These are examples of what this data looks like.&#8221; Using that data, Cursor could model in a clever way. For example, a column called <code>quarter</code> like &#8220;2025 Q1&#8221;&#8212;Cursor understood to translate it into a datetime and do the transformations.</p><h3>I've used the dbt MCP server a decent amount&#8212;less in Cursor, more in Claude Desktop. Your stack was Cursor + Claude models + Claude Desktop. And Cursor cannot directly execute queries in Snowflake, but Claude Desktop can. Is that because there&#8217;s tool use Claude has that Cursor doesn't?</h3><p>I believe so. In Claude Desktop, if you write queries against dbt-MCP, Claude can visualize a graph, show outputs of a SQL statement, etc. Cursor, as far as I know, couldn't. My middle ground was to take sample data out of Snowflake, put it into a CSV, and feed that back into Cursor so it could look at raw data.</p><h3>As part of its own context window?</h3><p>Exactly. That was key for my workflow. Then when I wanted to write unit tests, I could use real data examples from the sample. Or when automatically documenting the data, I asked Cursor to specify examples in the docs based on the most common occurrences within a column. Letting Cursor peek at raw data was a core pillar.</p><h3>It's a little hacky, right? Cursor should really be able to interact directly with Snowflake or Databricks to investigate the shape of the data. Agents should be empowered to do that.</h3><p>I would say so. There might be a way I didn&#8217;t know about, but I patched the gaps by uploading into the context window.</p><h3>So that's the state of the art today.</h3><p>Seems so. To be clear, I think the limitation is IDE differences&#8212;Cursor vs. Claude Desktop&#8212;rather than dbt-MCP itself.</p><h3>Once you had sample data in context, did you have to suggest conversions, or did it naturally do them?</h3><p>It got the defaults pretty right, but I guided it on what I wanted from the source data. I wanted control over everything, so I asked it to do one model at a time rather than auto-generate a whole stack. That way I could review each step and stay in control.</p><h3>Your prompt workflow was &#8220;Build me a model with this name that stages the data from this table,&#8221; basically?</h3><p>Yeah. When it proposed code I didn't like, upstream it was usually simple (regex to parse dates, etc.). Downstream, in marts and metrics, I started describing my ideal data product: user jobs-to-be-done and the final output. That&#8217;s when Cursor got creative and invented metrics I hadn&#8217;t anticipated&#8212;like &#8220;apartment price relative to time on market.&#8221; I pruned ones I didn&#8217;t want, but some were good surprises.</p><h3>Which layer did it help most?</h3><p>Testing. Modeling was good&#8212;especially staging&#8212;but testing accelerated significantly. SQL is a bit like English; for simple datasets you can express intent easily. Testing can be much harder and more verbose.</p><h3>Roughly how much more effective did you feel?</h3><p>Modeling: multiples faster. It nailed the tedious parts&#8212;regex, casting, pass-throughs&#8212;so staging/intermediate layers flew. In marts/semantic metrics, the benefit was brainstorming. It helped me think of metrics I wouldn't have.</p><h3>Did the dbt Fusion engine help?</h3><p>Yes. Fusion shows lineage and whether a column is pass-through. For example, if a column is pass-through with no transforms, don't add another <code>not_null</code> or <code>unique</code> if there's one upstream. I bounced between the IDE to check this and codified it as a testing strategy. That's already top-10% testing hygiene.</p><h3>Any MCP feature requests surface?</h3><p>The more context and tools the agent has, the more it can do. In the fourth post, for root cause analysis, we used the SYNQ MCP. We collect all your Git commits and have history, so the agent could correlate recent code changes with incidents. Requests depend on the job at hand.</p><h3>Let's move to testing&#8212;why was it the most additive?</h3><p>Testing is hard; many teams don't know how to do it and alert fatigue is common. A huge share of tests we see are <code>not_null</code>/<code>unique</code>, which doesn't reflect real data risks. First thing I did in Cursor for testing was provide our internal testing philosophy as guidelines: test heavily at the source, don't retest pass-through columns, focus on business and metric anomalies in marts. That worked really well. For sources and staging, it generated relevant tests. Then for marts, I asked for unit tests and gave it a thousand sample rows from Snowflake. It wrote very relevant unit tests I&#8217;d otherwise spend a lot of time on.</p><h3>Examples?</h3><p>Simple ones like: when you pass a string value in the date column, does it transform correctly to datetime and match the expected format? These just worked. Then at the metric level, it looked at raw data and proposed assumptions&#8212;like square-meter price should be between X and Y&#8212;sometimes segmenting by postcode. Very thoughtful, though I'd replace static thresholds with anomaly monitors so they don't go stale as prices move.</p><h3>So at least 5&#215; on testing?</h3><p>At least. Apart from swapping static thresholds for anomaly detection, it nailed testing and did so in a lineage-aware, layer-appropriate way.</p><h3>Tell me about the BI layer.</h3><p>Many teams start at the BI layer with a chat interface. I think that's risky because it's used by business users and you only get so many chances before trust drops. I moved into Omni. You create a &#8220;topic&#8221; (a data model you can join with others) and then specify an AI context: instructions for how the LLM should behave. For example: if a user asks about price, always return square-meter price; never make up fields not present in the mart; if asked about provenance, mention the source. Writing AI context is a new skill for our industry.</p><h3>Were you using Omni&#8217;s AI assistant to create assets faster, or to let users self-serve?</h3><p>The latter&#8212;so users could ask questions instead of going to a dashboard. It could have been any BI tool with similar functionality; we just use Omni internally.</p><h3>And how was the experience as a consumer?</h3><p>Amazing when it works, but I'd hesitate to give my VP of Marketing access. It gets things wrong maybe one in five times, and it's not obvious why if you're not a data person. For analysts doing exploratory work, it's great&#8212;they can inspect and dig in. I wouldn't replace company-wide dashboards with a chat bot yet. Omni does log freeform queries and feedback, so there's a path to iterate the AI context over time.</p><h3>The last thing you did was use AI plus SYNQ to monitor production infrastructure. What does observability look like in the future? Historically it's looked like dashboards&#8212;Datadog for data pipelines. Is it just more effective monitors, or fundamentally different?</h3><p>Fundamentally different. We&#8217;re heading to a place where observability tools can tell you what's wrong at the right time, with just the right context, delivered to the right person&#8212;inside or outside the data team. Done well, there may be few dashboards; instead you get an LLM-summarized root cause delivered from a monitor that might be auto-created. Less &#8220;active tool you poke at,&#8221; more &#8220;proactive explanation.&#8221;</p><h3>Still technical observability (pipelines/data issues), or business observability?</h3><p>More the former. Teams at the edges&#8212;Sales Ops managing Salesforce, engineering teams creating web events&#8212;often need to be notified about data issues. Business KPI movements require a different experience for marketers, etc.</p><h3>Automated remediation?</h3><p>Gradual. You can imagine an issue occurs without a dedicated test; the system proposes a new test. But 80% of issues come from root systems elsewhere (someone typing in Salesforce), and closing that loop is still hard. In the article&#8217;s fourth part, we had a data issue and I asked the SYNQ MCP through Claude Desktop to do root cause analysis. It walked the same steps a data person would: inspect the model, check errors, examine lineage and upstreams, review recent commits, and documented each step to the root cause. That works now.</p><h3>At the beginning you said there&#8217;s no good reason not to use these tools today. What reasons do you hear for not trying?</h3><p>People are busy. But if you look at a risk curve, lowest risk is modeling and testing&#8212;you're in the driver's seat and it boosts productivity. Higher risk is replacing your BI tool with a chat bot; higher still is customer-facing experiences. The first two are hard to argue against.</p><h3>Enterprise IT approvals might be one blocker&#8212;approved models, data access, etc.</h3><p>True. For example, our MCP can query raw data to detect if an issue happens in a segment, and enterprises might hesitate there. Also, &#8220;MCP&#8221; as a term can be confusing. But it's actually simple and explainable, not a black box. Setting up dbt-MCP can still feel hacky in enterprises; if it lived natively in cloud environments, it&#8217;d be easier to adopt.</p><h3>You can set it up locally&#8212;no permissions/procurement&#8212;and just play. We also shipped the MCP server as a remote MCP in cloud, though that introduces auth/permissions considerations.</h3><p>If I had to pick a persona, it's the analyst. Analysts have had a tough decade: more tools, harder workflows, less time to tinker. MCPs and AI workflows are a turning point. At Monzo, we had a philosophy that you should be able to have an idea on your commute and have it implemented by midday. As we grew to 10,000 dbt models and long CI checks, that faded. I can see a world where this returns. MCPs can help. I'm excited.</p><h3>I love that. Analytics engineers think &#8220;infrastructure, correctness.&#8221; Analysts think &#8220;idea to validation fast.&#8221; Excel was always the analyst&#8217;s best friend because it's fast and flexible. MCPs make it easy to plug tools together and get answers quickly again.</h3><p>One company we work with&#8212;Voi, a scooter company out of Sweden&#8212;has a strong data leader, Magnus, who is very bought into metrics. Their data team doesn't produce dashboards; they produce metrics. In an AI world with MCPs, flows, and curves, that's a clear decision.</p><h3>I believe there's no such thing as the wrong BI tool&#8212;different tools have different trade-offs. Probably true for models/IDEs too: Claude Desktop vs. Claude Code vs. Cursor&#8212;no single &#8220;right answer&#8221; as long as the underlying context and metric definitions are shared.</h3><p>Agreed. What really matters across workflows: consistent metric definitions, documentation for columns and fields, and high-quality data. Those foundations matter even more when an LLM is in the loop; you may not have a human sanity-checking every result.</p><h2>Chapters</h2><ul><li><p><strong>00:00</strong> &#8212; Tristan&#8217;s intro</p></li><li><p><strong>01:10</strong> &#8212; Mikkel&#8217;s background: shipping &#8594; Google &#8594; Monzo &#8594; SYNQ</p></li><li><p><strong>03:08</strong> &#8212; What SYNQ does (data observability for business-critical data)</p></li><li><p><strong>04:15</strong> &#8212; Running the experiment</p></li><li><p><strong>06:23</strong> &#8212; Scope: modeling, testing, BI agent, observability</p></li><li><p><strong>07:17</strong> &#8212; Tooling: Cursor + dbt MCP server + Snowflake + Omni</p></li><li><p><strong>09:38</strong> &#8212; Sampling real data into the agent&#8217;s context</p></li><li><p><strong>13:14</strong> &#8212; Modeling workflow: one model at a time</p></li><li><p><strong>15:14</strong> &#8212; Where agents help most: testing &gt; modeling</p></li><li><p><strong>18:10</strong> &#8212; dbt Fusion engine: lineage-aware checks, fewer redundant tests</p></li><li><p><strong>19:50</strong> &#8212; Feature requests and root-cause via commit history</p></li><li><p><strong>20:57</strong> &#8212; Testing philosophy: source-heavy, pass-through aware, metric-level</p></li><li><p><strong>22:49</strong> &#8212; Unit tests from samples; thresholds vs anomaly monitors</p></li><li><p><strong>25:10</strong> &#8212; BI agents: great for analysts, risky for broad rollout</p></li><li><p><strong>31:54</strong> &#8212; The future of observability: explain first, dashboards second</p></li><li><p><strong>36:10</strong> &#8212; Adoption curve: safe places to start</p></li><li><p><strong>40:49</strong> &#8212; Analyst superpowers return</p></li><li><p><strong>42:04</strong> &#8212; Metrics over dashboards</p></li></ul><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Under the hood of Apache Iceberg (w/ Christian Thiel)]]></title><description><![CDATA[The cofounder of Lakekeeper walks Tristan through the state of the Iceberg ecosystem]]></description><link>https://roundup.getdbt.com/p/under-the-hood-of-apache-iceberg</link><guid isPermaLink="false">https://roundup.getdbt.com/p/under-the-hood-of-apache-iceberg</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 24 Aug 2025 13:03:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/431e88b6-61e9-4287-8806-61a6027eb357_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://coalesce.getdbt.com/event/21662b38-2c17-4c10-9dd7-964fd652ab44/summary/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q3-2026_coalesce-2025_aw&amp;utm_content=coalesce____&amp;utm_term=all_all__" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!C5Cy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1203274,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://coalesce.getdbt.com/event/21662b38-2c17-4c10-9dd7-964fd652ab44/summary/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q3-2026_coalesce-2025_aw&amp;utm_content=coalesce____&amp;utm_term=all_all__&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/171688472?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!C5Cy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 424w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 848w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1272w, https://substackcdn.com/image/fetch/$s_!C5Cy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4300a1a6-b64a-467e-b1fd-453de570692d_3168x792.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><p>If you're a data practitioner, you likely understand Iceberg as a user, why it's important, and how it's changing the way that we build data systems. But you may not know a lot about what going on beneath the surface.</p><p>There are multiple ways to interface with Iceberg catalogs, multiple versions of the Iceberg REST spec. There's several leading catalogs that implement that spec. All this in an ecosystem that includes companies of all sizes, in proprietary and open-source code, and in academic and commercial contexts.</p><p>In a few years, all this ambiguity will be behind us, but right now it's very much evolving in real-time. To get an update on the status of the Iceberg ecosystem and to walk through all the developments, Tristan talks with Christian Thiel. Christian is one of the lead architects of Lakekeeper, of one of the most widely used Iceberg catalogs.</p><p><strong><a href="https://coalesce.getdbt.com/event/21662b38-2c17-4c10-9dd7-964fd652ab44/summary/?utm_medium=social&amp;utm_source=substack&amp;utm_campaign=q3-2026_coalesce-2025_aw&amp;utm_content=coalesce____&amp;utm_term=all_all__">To learn more from some of the leaders in the Iceberg ecosystem, join us at Coalesce 2025 in Las Vegas, Oct. 13-16</a></strong>.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h2>Key takeaways</h2><h2>Walk us through your background</h2><p><strong>Christian Thiel:</strong> I started in natural language processing, then moved into machine learning applications in manufacturing. Like many people, I realized that the biggest barrier wasn&#8217;t the algorithms but the data&#8212;its availability, quality, and accessibility. That led me deeper into data architecture and engineering, eventually to building Lakekeeper.</p><h2>What is Lakekeeper, and what are you building now?</h2><p>Lakekeeper is an Iceberg catalog implementation&#8212;a technical requirement for building distributed, composable analytic systems based on Apache Iceberg. But our vision goes beyond that. We see the future in data collaboration and reliable sharing of data, supported by clear contracts.</p><h2>For listeners new to Iceberg, what makes it so important?</h2><p>Iceberg allows organizations to store data once, in an open format, and then use the compute engine best suited for each workload. It&#8217;s a foundation for building modern, composable data platforms while avoiding vendor lock-in. If there&#8217;s one thing that should be open, it&#8217;s the data at the center of your platform.</p><h2>Some folks might say this sounds like Hadoop all over again&#8212;lots of open standards that are hard to integrate. Why is this time different?</h2><p>The ecosystem has matured. Even big vendors like Snowflake and Databricks are embracing Iceberg, which shows there&#8217;s a strong shift toward openness. Plus, the tooling and infrastructure are much easier to deploy today. A modern Iceberg setup is far less complex than a Hadoop environment used to be.</p><h2>Let&#8217;s talk about what&#8217;s happening under the hood. How does Iceberg work?</h2><p>Iceberg organizes data using a metadata hierarchy. At the top, there&#8217;s a JSON file that stores high-level table information: snapshots, schema, and locations. Below that are manifests and other layers that keep track of files. This hierarchy is what makes things like time travel, atomic transactions, and schema evolution possible.</p><h2>What about ongoing maintenance?</h2><p>There are two key tasks. First, expiring old snapshots so you don&#8217;t accumulate unnecessary files. Second, compaction&#8212;combining many small files into larger ones </p><h2>Catalogs are another critical piece. What role do they play?</h2><p>Catalogs manage the top layer of metadata and coordinate transactions. They make atomic updates possible, allow multiple writers, and handle governance&#8212;things like access control and multi-table transactions.</p><h2>How enterprise-ready is Iceberg today?</h2><p>Very ready. A year ago, there were still gaps, but today, performance and feature parity with native tables on platforms like Snowflake and BigQuery are strong. Governance and authorization models are still evolving, and different catalogs implement them differently, but the core functionality is there.</p><h2>Speaking of catalogs, how should someone pick between options like Lakekeeper, Polaris, Unity, AWS Glue, or Gravitino?</h2><p>Christian Thiel: It depends on priorities. Lakekeeper focuses on performance, extensibility, and ease of use. Polaris is developer-focused but less user-friendly. Unity is tightly integrated into Databricks. Glue now supports the Iceberg REST spec, which makes it more interoperable than before. Gravitino is another option aimed at enterprise-scale environments.</p><h2>Recently, DuckDB announced DuckLake. What&#8217;s your take on that?</h2><p>It&#8217;s interesting, but there are two concerns. First, it uses a database schema directly for the catalog, which creates interoperability issues&#8212;similar to the early JDBC catalog in Iceberg that the community eventually moved away from. Second, it was built without community involvement, and openness without adoption isn&#8217;t really openness.</p><p>That said, for heavy DuckDB users, it could offer optimizations that make queries extremely fast, and if the broader ecosystem adopts it, it could become a viable open format.</p><h2>What&#8217;s next for Lakekeeper?</h2><p>We&#8217;re continuing to invest in table optimization, enterprise features, and data collaboration tools. Our vision is what we call the &#8220;unbreakable lakehouse,&#8221; where contracts and collaboration guardrails make shared data more reliable. Long-term, we see Lakekeeper as enabling truly collaborative, open data ecosystems.</p><h2>Chapters</h2><ul><li><p><strong>00:00 &#8211; Introduction</strong></p><p>Tristan Handy introduces the episode and the focus on Apache Iceberg.</p></li><li><p><strong>01:40 &#8211; Christian Thiel&#8217;s background</strong></p><p>From natural language processing to data engineering.</p></li><li><p><strong>04:30 &#8211; Introduction to Lakekeeper</strong></p><p>What Lakekeeper is and its role in the Iceberg ecosystem.</p></li><li><p><strong>06:00 &#8211; Why Iceberg matters</strong></p><p>How open table formats enable flexibility and reduce vendor lock-in.</p></li><li><p><strong>11:40 &#8211; How Iceberg works under the hood</strong></p><p>Metadata hierarchy, catalogs, and how state is managed.</p></li><li><p><strong>21:30 &#8211; Maintenance and optimization</strong></p><p>Snapshot expiration, compaction, and keeping tables performant.</p></li><li><p><strong>24:20 &#8211; Catalogs and governance</strong></p><p>Access control, multi-table transactions, and security.</p></li><li><p><strong>31:40 &#8211; Enterprise readiness</strong></p><p>How Iceberg is evolving for production use in large organizations.</p></li><li><p><strong>42:10 &#8211; Choosing the right catalog</strong></p><p>Overview of Lakekeeper, Polaris, Unity, Glue, and Gravitt.</p></li><li><p><strong>47:20 &#8211; DuckLake discussion</strong></p><p>Pros, cons, and ecosystem adoption challenges.</p></li><li><p><strong>52:00 &#8211; The future of Lakekeeper</strong></p><p>Data contracts, collaboration, and building the &#8220;unbreakable lakehouse.&#8221;</p></li></ul><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The pragmatic guide to AI agents in the enterprise (w/ Sean Falconer) ]]></title><description><![CDATA[Demystifying AI agents with Confluent's senior director of AI strategy]]></description><link>https://roundup.getdbt.com/p/the-pragmatic-guide-to-ai-agents</link><guid isPermaLink="false">https://roundup.getdbt.com/p/the-pragmatic-guide-to-ai-agents</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 03 Aug 2025 13:02:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/16b80e19-0489-465c-8dba-64088edba31f_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>What does it mean to be agentic? Is there a spectrum of agency? </p><p>In this episode of The Analytics Engineering Podcast, Tristan Handy talks to Sean Falconer, senior director of AI strategy at Confluent, about AI agents. They discuss what truly makes software "agentic," where agents are successfully being deployed, and how to conceptualize and build agents within enterprise infrastructure. </p><p>Sean shares practical ideas about the changing trends in AI, the role of basic models, and why agents may be better for businesses than for consumers. This episode will give you a clear, practical idea of how AI agents can change businesses, instead of being a vague marketing buzzword.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h2>Key takeaways</h2><h3><strong>Sean, can you give us the TLDR on your career and what you're working on today?</strong></h3><p><strong>Sean Falconer: </strong>I've always worked at the intersection of data, engineering, and AI. From academia studying computer science, into industry as a founder, then to Google, I worked on conversational systems and privacy/security in AI. Currently, at Confluent, I'm leading our AI product strategy, balancing both technical and go-to-market roles.</p><h3><strong>You moved from being deeply technical into marketing and sales. What drove that transition?</strong></h3><p>I was forced into it as a founder. Initially uncomfortable, but it taught me huge respect for marketing and sales. I had to learn by making many mistakes, eventually building out entire marketing and sales functions. I realized how challenging and critical these roles are.</p><h3><strong>You were at Google before ChatGPT launched. Did you foresee the transformative nature of these technologies?</strong></h3><p>Honestly, no. Having seen earlier disappointments in conversational AI (like Microsoft's Alice), I was skeptical initially, even as ChatGPT emerged. It wasn&#8217;t obvious we'd soon experience this revolution.</p><h3><strong>You&#8217;ve written about three waves of AI. Can you describe these?</strong></h3><p>Yes. Wave one was predictive AI, traditional ML models trained for specific tasks like fraud or spam detection&#8212;effective but rigid. Wave two introduced generative AI, or foundation models, trained on vast general datasets, flexible but lacking specific business context. The third wave, agentic AI, involves AI systems that can reason, dynamically choose tasks, gather information, and perform actions as a more complete software system.</p><h3><strong>Do foundation models replace traditional ML methods?</strong></h3><p>Sometimes they can, but it doesn&#8217;t always make sense. An LLM might do sentiment analysis well enough, but a traditional model may be more efficient and cheaper. Think of using an LLM as cutting steak with a chainsaw&#8212;possible, but unnecessary.</p><h3><strong>Let's clarify "agents." What makes software truly agentic?</strong></h3><p>It&#8217;s software that can dynamically decide its own control flow: choosing tasks, workflows, and gathering context as needed. Realistically, current enterprise agents have limited agency to ensure reliability. They're mostly workflow automations rather than fully autonomous systems.</p><h3><strong>You mentioned a spectrum of agency. Is this similar to autonomy in self-driving cars?</strong></h3><p>Exactly. Highly autonomous agents are appealing but not practical yet. Most enterprise success stories involve structured workflows with clearly defined boundaries.</p><h3><strong>Why have agents taken off more in enterprises than consumer apps?</strong></h3><p>Enterprises have many well-defined, high-value tasks perfect for automation. Consumer scenarios demanding high agency&#8212;like planning complex trips&#8212;are still too unreliable. Enterprises can benefit significantly even from limited agentic capability.</p><h3><strong>Is an agent just a microservice?</strong></h3><p>In many ways, yes. An agent functions like a microservice with extra capabilities (using LLMs for decisions). Deployment considerations like state management and long-running tasks differ slightly, but fundamentally it&#8217;s similar.</p><h3><strong>What tools and frameworks help build effective agents?</strong></h3><p>Start with frontier models like GPT-4 or Claude. Frameworks include LangChain, Microsoft Autogen, and CrewAI. But for real-world deployment, treat it as rigorous software engineering with observability, scalability, and robustness in mind.</p><h3><strong>Are organizational barriers bigger than technical challenges?</strong></h3><p>Yes. AI efforts are often mistakenly tasked to data science teams rather than cross-functional software teams. Successful companies create dedicated teams blending software engineering skills and data expertise to build reliable agentic systems.</p><h3><strong>What pitfalls should teams avoid?</strong></h3><p>Avoid building monolithic agents. Break systems into smaller, well-defined units in a multi-agent architecture. Use event-driven frameworks to avoid rigid, hard-to-maintain dependencies.</p><h2>Chapters</h2><ul><li><p>[00:00] Introduction: What's all the hype about agents?</p></li><li><p>[01:10] Meet Sean Falconer: A journey from engineer to AI strategist</p></li><li><p>[04:10] Learning marketing as an engineer-founder</p></li><li><p>[05:50] Inside Google's AI efforts before ChatGPT</p></li><li><p>[09:00] What does it mean to run AI strategy?</p></li><li><p>[10:45] Three waves of AI: Predictive, Generative, and Agentic</p></li><li><p>[16:30] Will foundation models replace traditional ML?</p></li><li><p>[18:30] Defining agents clearly: Beyond the buzzword</p></li><li><p>[22:00] The spectrum of agency: From controlled workflows to open-ended tasks</p></li><li><p>[25:30] Why agents fit better in enterprises than consumer apps</p></li><li><p>[28:00] Agents as microservices: A practical view</p></li><li><p>[35:00] What tech stack is needed to build effective agents?</p></li><li><p>[37:50] Organizational challenges in adopting agents</p></li><li><p>[39:30] Models that are favorites for developers</p></li><li><p>[43:30] Why software engineers are best placed to build agents</p></li><li><p>[46:00] The technical stumbling blocks in building agents</p></li><li><p>[48:00] Concluding thoughts: Beyond POCs to production agents</p></li></ul><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[How Amazon S3 works (w/ Andy Warfield)]]></title><description><![CDATA[Go under the hood of Amazon S3 with AWS engineering leader Andy Warfield&#8212;from virtualization to Iceberg]]></description><link>https://roundup.getdbt.com/p/how-amazon-s3-works-w-andy-warfield</link><guid isPermaLink="false">https://roundup.getdbt.com/p/how-amazon-s3-works-w-andy-warfield</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 20 Jul 2025 12:02:56 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/92b37acf-08ac-4dac-b59f-123b21df7011_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this season of the Analytics Engineering podcast, Tristan is deep into the world of developer tools and databases. If you're following us here, you've almost definitely used Amazon S3 it and its Blob Storage siblings at Microsoft and Google. They form the foundation for nearly all data work in the cloud. In many ways, it was the innovations that happened inside of S3 that have unlocked all of the progress in cloud data over the last decade. </p><p>In this episode, Tristan talks with Andy Warfield, VP and senior principal engineer at AWS, where he focuses primarily on storage. They go deep on S3, how it works, and what it unlocks. They close out talking about Iceberg, S3 table buckets, and what this all suggests about the outlines of the S3 product roadmap moving forward.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h2>Key takeaways</h2><h3>Operating systems, garage sales, and Xen</h3><p><strong>Tristan Handy: You&#8217;ve done a lot over the last 20 years. Before we get into specifics, can you just share a little about your journey as a software engineer?</strong></p><p><strong>Andy Warfield:</strong> I just like playing with computers.  I studied computer science in Ontario for undergrad, then moved to Vancouver for grad school, then to the UK for a PhD. I worked on operating systems, low-level stuff. I got to work on a hypervisor called Xen, which ended up being used by a lot of cloud providers, including Amazon.</p><p>After that, I did a couple of startups, one around Xen. Then I became a professor at UBC, teaching operating systems, networking, and security. Later, I did another startup in storage, and eventually I joined Amazon.</p><p>Now I have this highfalutin role&#8212;VP and engineer&#8212;working across S3, other storage services, and now a bunch of analytics services too. I get to cause trouble in lots of different parts of the cloud.</p><p><strong>VP slash distinguished engineer&#8212;does that mean you just get to march around telling people how to improve their stuff?</strong></p><p>People love that! I&#8217;d say about half the time I&#8217;m causing trouble&#8212;starting things and encouraging new ideas&#8212;and the other half I&#8217;m helping teams dig out from those ideas. Sometimes I take over a team if we&#8217;re doing something especially interesting or innovative, just so I can be closer to the action.</p><p><strong>That sounds like a pretty good gig if you can get it.</strong></p><p>It&#8217;s amazing. I&#8217;ve been here nearly eight years, and I still love this job.</p><div><hr></div><h3>The rise of virtualization and the origin of Xen</h3><p><strong>I want to talk about Xen. You said you were always interested in operating systems, which is kind of a niche fascination. What drew you in?</strong></p><p>When I was a kid, we didn&#8217;t have much money, so I built computers from garage sale parts in Ottawa. In high school, I found this federal government warehouse that sold off old equipment. I started a little business buying pallets of hardware for cheap, fixing them up, and reselling.</p><p>It was chaotic&#8212;but I learned a lot. I dealt with machines like IBM DisplayWriters with 8-inch floppy disks and massive dot-matrix printers. Getting them working meant diving into their software and systems.</p><p>Eventually I played with Linux, hacked on the kernel, and that all led me into OS research and development.</p><p><strong>Tristan: So what is a hypervisor, and why did virtualization become so important in the 2000s?</strong></p><p><strong>Andy:</strong> There were two big drivers: server utilization and isolation.</p><p>Companies had racks full of 1U servers, most of which sat idle most of the time. But they couldn&#8217;t share workloads because apps weren&#8217;t isolated well&#8212;config conflicts, shared resources, etc.</p><p>Virtualization allowed multiple operating systems to run on the same hardware, with isolation. It also let you consolidate servers, which had big cost and efficiency benefits.</p><p>There was also a technical challenge: x86 processors weren&#8217;t designed to be virtualized. That made it a really interesting research problem. We wanted to see if it could even be done&#8212;and done efficiently.</p><p><strong>Tristan: And Intel eventually started building virtualization support into the hardware?</strong></p><p><strong>Andy:</strong> Exactly. Our work on Xen and similar projects showed it was possible. That pushed Intel and AMD to add features like VT-x, which made it easier and more performant to run hypervisors.</p><p><strong>Tristan: How did AWS end up using Xen?</strong></p><p><strong>Andy:</strong> I wasn&#8217;t part of those internal conversations, but the story goes that a small startup in Cape Town, South Africa, was building a control plane for Xen. That team got picked up by AWS and became the basis for EC2.</p><div><hr></div><h3>Understanding Amazon S3</h3><p><strong>Tristan: Let&#8217;s switch to S3. I think a common mental model is that S3 is just a big pool of SSDs. But that&#8217;s clearly not the whole story. How do you explain what S3 actually is?</strong></p><p><strong>Andy:</strong> That&#8217;s one of my favorite questions.</p><p>Early on, S3 was like a storage locker. You&#8217;d rent space to stash things you didn&#8217;t need right away&#8212;backups, static files, CDN origins. Latency wasn&#8217;t great, but durability and availability were.</p><p>Things really changed when the Hadoop community built S3A&#8212;an adapter to let Hadoop use S3 instead of HDFS. Suddenly, we had people doing real analytics on S3. The system had enough drives to support massive parallel reads.</p><p>Today, workloads are way more demanding. Performance, consistency, and latency matter. We&#8217;ve been evolving the system constantly to meet those needs.</p><p><strong>Tristan: Are we talking about billions of hard drives?</strong></p><p><strong>Andy:</strong> I can&#8217;t share exact numbers, but yes&#8212;it's a lot of hard drives. Some of our largest customers have data spread across <em>millions</em> of drives. And most drives are shared across multiple customers.</p><p><strong>Tristan: And these aren&#8217;t SSDs?</strong></p><p><strong>Andy:</strong> Mostly spinning disks, actually. Hard drives are terrible at latency, but they&#8217;re cheap and good for bursty workloads. Spreading your data across many disks lets you take advantage of parallelism.</p><div><hr></div><h3>S3&#8217;s durability, performance, and scale</h3><p><strong>Tristan: Let&#8217;s talk about S3&#8217;s durability promise: 11 nines. How do you achieve that?</strong></p><p><strong>Andy:</strong> We use erasure coding&#8212;a form of RAID-like redundancy that lets you split data into parts and parity blocks. Then we store those shards across different availability zones.</p><p>We constantly monitor for failures. Disks die all the time, so we have fleets of processes repairing and maintaining durability. It&#8217;s not static. It&#8217;s a living system.</p><p><strong>Tristan: You must have incredibly precise failure models.</strong></p><p><strong>Andy:</strong> We do. We track failure rates, temperature sensitivity, vendor behavior&#8212;everything. That allows us to be proactive and surgical in how we manage risk.</p><div><hr></div><h3>From Parquet to Iceberg to S3 table buckets</h3><p><strong>Tristan: I want to talk about table formats. Parquet is everywhere now. And then we got Hive Metastore, then Iceberg. Why did S3 launch table buckets?</strong></p><p>Parquet is great, but it&#8217;s just files. Customers kept asking for more structured semantics: schema evolution, upserts, ACID transactions.</p><p>We saw Iceberg adoption grow rapidly&#8212;especially among our largest analytics customers. But they were struggling with operational complexity: too many small files, custom compactors, brittle catalogs.</p><p>So we launched S3 table buckets to bring native Iceberg support to S3. That includes:</p><ul><li><p>Automatic compaction</p></li><li><p>A REST catalog</p></li><li><p>High-performance access</p></li></ul><p>We wanted to make it easier to treat Iceberg as a storage primitive, not just an analytics backend.</p><p><strong>So this is a shift in philosophy&#8212;S3 isn&#8217;t just object storage, it&#8217;s now table-aware?</strong></p><p>Exactly. Historically, S3 was just where you stored objects. Now, we&#8217;re thinking more about what those objects <em>mean</em>.</p><p>We also launched S3 object metadata tables&#8212;a way to semantically describe and query your object store, especially useful for AI workloads using retrieval-augmented generation (RAG).</p><div><hr></div><h3>The future of open data and S3</h3><p><strong>What does the future of S3 look like? Where&#8217;s this going?</strong></p><p>We&#8217;re headed toward more structure, more semantics, and more performance.</p><p>Inference workloads are scaling fast. AI models are hitting S3 hundreds of thousands of times per second to do vector lookups. That&#8217;s changing how we think about indexing, metadata, and latency.</p><p>We want to make S3 the best place to do open, flexible, high-scale data work&#8212;from tables to training data to retrieval.</p><h2>Chapters</h2><p><strong>[01:42] Meet Andy Warfield</strong></p><p>Andy shares his background, including startups, professorship, and his current role as VP &amp; Senior Principal Engineer at AWS.</p><p><strong>[05:10] From garage sales to hypervisors</strong></p><p>Andy describes his early passion for hardware, OS development, and the origin story behind the Xen hypervisor.</p><p><strong>[08:50] Why virtualization took off in the 2000s</strong></p><p>Exploring why isolation, utilization, and technical curiosity fueled the rise of hypervisors.</p><p><strong>[14:30] Xen vs. VMware and the road to AWS</strong></p><p>How Xen became the default for EC2 and the technical differences between virtualization approaches.</p><p><strong>[17:35] The origin of EC2 and S3</strong></p><p>How a team from Cape Town helped launch AWS compute&#8212;and the early days of cloud services.</p><p><strong>[20:00] What is S3, really?</strong></p><p>Andy breaks down the mental model behind S3: not just object storage, but a scalable data platform.</p><p><strong>[22:49] How many drives? More than you think</strong></p><p>Why S3 storage spans millions of drives&#8212;and how AWS uses scale to deliver performance.</p><p><strong>[28:10] The 11 nines durability model</strong></p><p>Inside S3&#8217;s approach to reliability, failure tolerance, and background repairs using erasure coding.</p><p><strong>[32:00] Tail latency and engineering for bursty workloads</strong></p><p>Why slow requests matter, and how S3 teams optimize for streaming, AI, and analytics use cases.</p><p><strong>[35:20] Iceberg, metadata, and table buckets</strong></p><p>The emergence of Apache Iceberg as a table format&#8212;and AWS&#8217;s new structured storage approach.</p><p><strong>[38:00] Why S3 added a REST catalog and compaction</strong></p><p>How AWS is simplifying the operational burden of working with Iceberg at scale.</p><p><strong>[40:00] A new mental model for object storage</strong></p><p>S3 is no longer just about storing files&#8212;it&#8217;s about managing semantics, lineage, and trust.</p><p><strong>[44:00] Looking ahead: S3, RAG, and semantic metadata</strong></p><p>How S3 is preparing for the next wave of AI, inference, and context-aware applications.</p><p><strong>[47:20] Is Iceberg ready for enterprise?</strong></p><p>Andy shares thoughts on enterprise readiness, performance tradeoffs, and real-world adoption of table formats.</p><p><strong>[49:05] Wrap-up and reflections</strong></p><p>Tristan and Andy reflect on the conversation and where data infrastructure is headed next.</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[It is time to take agentic workflows for data work seriously]]></title><description><![CDATA[Mission: 0 to semantic layer in two hours]]></description><link>https://roundup.getdbt.com/p/it-is-time-to-take-agentic-workflows</link><guid isPermaLink="false">https://roundup.getdbt.com/p/it-is-time-to-take-agentic-workflows</guid><dc:creator><![CDATA[Jason Ganz]]></dc:creator><pubDate>Sun, 29 Jun 2025 12:53:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/43df1d45-822b-4677-acfe-e16a64853ca1_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Last week, I cleared two hours on my calendar to do a deep dive into the current state of agentic development for data work.</p><p>Specifically, I gave myself a challenge - could I go from a never-before-seen dataset to a production-ready Semantic Layer using a combination of tools:</p><ul><li><p>An agentic coding CLI (I used <a href="https://www.anthropic.com/claude-code">Claude Code</a> for this experiment)</p></li><li><p>The <a href="https://github.com/dbt-labs/dbt-mcp">dbt MCP server</a></p></li><li><p>A terminal interface (in this case <a href="https://www.warp.dev/">Warp</a>)</p></li></ul><p>Before we go any further, if this is at all interesting to you, I suggest that instead of reading my findings here that you sit down and try this yourself. I'm quite confident you'll find it both illuminating and worth your time.</p><p>We'll get to my findings in a bit. Long story short - it was successful enough that it  shifted my thinking about the near-term trajectory of data work. </p><p>But first, let's talk about why experiments like this matter so much right now.</p><p><strong>Sensemaking in the age of AI</strong></p><p>You've probably been hearing some variant of these takes multiple times a day:</p><p><em>"An agent is just an LLM run in a loop"</em></p><p><em>"AI agents are coming to replace white collar work"</em></p><p><em>"I don't even know what an AI agent is, this is just marketing hype"</em></p><p>And about a billion more. All of these represent our collective attempts at sensemaking in this unique technological moment. But honestly, the noise can be so overwhelming that it's tempting to just tune it all out and wait for the dust to settle.</p><p>I don't think that's an option for data practitioners. Instead, we need to develop our own internal compass for sensemaking - and that means getting our hands dirty.</p><p>To do great data work is to be a great sensemaker. My theory of sensemaking requires holding two paradoxical skills in tension:</p><ul><li><p>Build strong mental models about the world and use them to take decisive action</p></li><li><p>Constantly scan for misalignments between your models and reality, then adjust accordingly</p></li></ul><p>Organizations and institutions need time to metabolize change and adjust their mental models. There's a physics to it. And <a href="https://roundup.getdbt.com/p/a-new-kind-of-weird">that physics takes time</a>.</p><p>But when the underlying reality is changing rapidly, the best thing you can do is go make direct contact with that reality. Don't wait for the consensus to form - go see for yourself.</p><p>Because things are not the same as they were even 6 months ago:</p><ul><li><p>We've gotten the first wave of models optimized for agentic work (OpenAI&#8217;s O3, Claude 4 and Gemini 2.5)</p></li><li><p>We've started building real infrastructure to connect these models to our systems (MCP and other emerging protocols)</p></li><li><p>LLM-based coding has shifted from autocomplete to actual agents (something <a href="https://roundup.getdbt.com/p/should-we-even-care-about-using-llms">longtime Roundup readers saw coming</a>)</p></li></ul><p>That&#8217;s a bunch of big changes! It can sometimes feel like keeping up with everything here is a full time job. And with my last couple months being pretty tied up with <a href="https://docs.getdbt.com/blog/dbt-fusion-engine">other things</a> I felt like I owed it to myself to set aside some time and go deep here.</p><p><strong>The experiment: Two hours from zero to Semantic Layer</strong></p><p>I chose the <a href="https://app.snowflake.com/marketplace/providers/GZSOZ1LLBU/Weather%20Source%2C%20LLC">weather source</a> dataset on the Snowflake marketplace precisely because it was both interesting and completely unfamiliar to me. I booted up Warp (dbt MCP server already configured - that might add additional time here) and got started. </p><p>In two hours, I went from raw data to a <a href="https://github.com/dbt-labs/weather-climate-dbt/tree/main">working dbt project</a><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> with:</p><ul><li><p>Documented source definitions</p></li><li><p>Tested data models</p></li><li><p>A functional Semantic Layer with queryable metrics</p></li></ul><p>It felt incredible. A bit unbelievable. Of course this was just a simple project and nothing in here would be particularly difficult for an experienced analytics engineer - but it would have taken a whole lot of time and effort.</p><p>Some observations from the process:</p><p><strong>The experience was exhilarating.</strong> Watching an abstract goal decompose into concrete tasks, then seeing those tasks execute in real-time feels like witnessing something total new. It was also addicting - in this interface has a &#8220;just one more level&#8221; feeling of playing a great video game.</p><p><strong>The cognitive load is different.</strong> It was cognitively demanding but not in the same way that coding is cognitively demanding - I have a sense that I&#8217;d be able to sustain longer blocks of &#8220;pairing&#8221; with Claude code before getting mentally depleted than normal coding.</p><p><strong>The tools aren't optimized for data work yet.</strong> </p><ul><li><p>It first attempted to build out a bunch of models that depended on each other, but didn&#8217;t check if the first model actually <em>ran</em>. Then there was an error part of the way through it&#8217;s dependency and we had to do a bunch of unthreading.</p></li><li><p>It&#8217;s competent at writing SQL (and dbt-style SQL). I don&#8217;t expect this to be the bottleneck for AI augmented development.</p></li><li><p>It is not very good at understanding what columns or models it has access to at a given time - I expect this to be an area where the models will be most useful when assisted by deterministic tooling.</p></li></ul><p><strong>What this proved (and didn't prove)</strong></p><p>This experiment convinced me that agentic workflows have moved beyond &#8220;pure speculation&#8221; and into &#8220;definitely worth exploring and net useful for many teams today&#8221;. It feels pretty similar to the earlish days of coding assistants like copilot. Not yet for every team but definitely for some and on a steep acceleration curve.</p><p>This was just a simple experiment and I walked away thinking just as much about what I don&#8217;t know as what I learned.</p><ul><li><p>I still don't know if my models are logically sound (validation would take as long as building)</p></li><li><p>Enterprise-scale datasets might break this approach entirely</p></li><li><p>The actual utility of what I built remains untested</p></li><li><p>And even with all of this, there are just as many organizational bottlenecks that the data team face as technical. What implications does this have there (if any).</p></li></ul><p>But here's the thing: in two hours, I accomplished what would have taken me at least a full day manually - not just the modeling, but documentation, testing more. That is worth paying attention to.</p><p><strong>Your move</strong></p><p>When facing a question as vast as "How will AI reshape data work?", it's easy to get paralyzed. But the answer isn't in think pieces or Twitter debates - it's in running experiments.</p><p>My mental model shifted because I made contact with reality. Right now, data teams not using agentic workflows are doing just fine. But things are moving fast. It&#8217;s worth it, at the very least to get a sense of what the state of the world is here and think about how you might adapt to it. </p><p>So here's my challenge: Block two hours next week. Pick a dataset you don't know. Try to build something real with these tools. Report back and let me know.</p><p>The future of data work is being written right now, in thousands of small experiments by practitioners who refuse to wait for the dust to settle. If you're reading this, you have the expertise to contribute to our collective sensemaking.</p><p>What will you discover when you stop reading about AI and start building with it?</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>There&#8217;s a lot that I&#8217;d improve here for a production project - making this public to show a checkpoint for where I got in a timeboxed experiment</p></div></div>]]></content:encoded></item><item><title><![CDATA[From Docker to Dagger (w/ Solomon Hykes)]]></title><description><![CDATA[The creator of Docker on how containers changed everything]]></description><link>https://roundup.getdbt.com/p/from-docker-to-dagger-w-solomon-hykes</link><guid isPermaLink="false">https://roundup.getdbt.com/p/from-docker-to-dagger-w-solomon-hykes</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 22 Jun 2025 13:00:24 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/18dacb78-748c-463d-9553-ed6186da36e1_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this season of the Analytics Engineering podcast, Tristan is digging deep into the world of developer tools and databases. There are few more widely used developer tools than Docker. From its launch back in 2013, Docker has completely changed how developers ship applications. </p><p>In this episode, Tristan talks to Solomon Hykes, the founder and creator of <a href="https://www.docker.com/">Docker</a>. They trace Docker&#8217;s rise from startup obscurity to becoming foundational infrastructure in modern software development. Solomon explains the technical underpinnings of containerization, the pivotal shift from platform-as-a-service to open-source engine, and why Docker&#8217;s developer experience was so revolutionary. </p><p>The conversation also dives into his next venture <a href="https://dagger.io/">Dagger</a>, and how it aims to solve the messy, overlooked workflows of software delivery. Bonus: Solomon shares how AI agents are reshaping how CI/CD gets done and why the next revolution in DevOps might already be here.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h2>Key takeaways</h2><p><strong>Tristan Handy: I want to get you to give a little background on yourself, where you've been, what you've been up to for the last couple decades. I think many people will know you as the person who kicked off an avalanche that changed how we interact with compute environments by inventing Docker?</strong></p><p><strong>Solomon Hykes: </strong>Docker is the thing I'm known for. Pre-Docker, I grew up in France. I studied programming in a French school called EpiTech. It was a brand-new, unconventional school where you learned through nonstop programming, which I loved.</p><p>Eventually, I got exposed to startups, despite being a complete outsider. I met someone who told me about them, and it stuck in my mind. Still in France at the time, I moved into my mom's house in the suburbs of Paris and worked out of the basement.</p><p>By complete luck, I got into an early version of Y Combinator in 2010. That got us on the path to what would become Docker three years later. In 2013, we pivoted to Docker from our previous company, dotCloud.</p><p><strong>Tristan Handy: The original thing was called dotCloud, right?</strong></p><p><strong>Solomon Hykes: </strong>Yep. It was about container technology and its potential, but we didn't quite know how to take it to market. DotCloud was about deploying and hosting people's apps&#8212;platform as a service&#8212;competing with Heroku and many clones.</p><p><strong>Tristan Handy: When did Heroku become a thing?</strong></p><p><strong>Solomon Hykes: </strong>I became aware of it in 2009. Just as I was struggling in France with container tech. When we joined YC in 2010, we packaged that tech into dotCloud, our hosting platform. Our differentiator was using containers under the hood when others didn&#8217;t. That let us support many language stacks and even run databases in containers&#8212;which was unheard of at the time.</p><p>Platform as a service was a tough business. Most startups went out of business or got acquired early. Eventually, we pivoted from selling the car to building an ecosystem around the engine&#8212;that became Docker.</p><p><strong>Tristan Handy: Did you pivot because selling the car wasn't working? Or because people kept pointing at the engine saying, &#8220;Give me that&#8221;?</strong></p><p><strong>Solomon Hykes: </strong>Both. It was hard to market platforms. Developers expected free hosting, and hosting costs money. Margins were tight because of AWS. It always felt like pushing a boulder uphill. Meanwhile, people wanted to run things locally. There was no good ecosystem for that. Docker provided transparency, flexibility, and portability.</p><p><strong>Tristan Handy: Can you define Docker and containerization, and how it differs from virtualization?</strong></p><p><strong>Solomon Hykes: </strong>Sure. Virtualization splits a physical machine into virtual ones using VMs&#8212;each with its own memory, compute, and storage. It gives flexibility, but with overhead.</p><p>Containerization does something similar but at the operating system level. Instead of virtualizing the machine, you split the OS itself. It&#8217;s mostly done with Linux, which can subdivide itself into isolated units. Containers are more lightweight, letting you run hundreds or thousands, unlike VMs where you might manage a handful before hitting limits.</p><p>Docker didn&#8217;t invent this, but we solved new problems with it.</p><p><strong>Tristan Handy: I remember creating my first Docker container around 2015. I expected a slow boot-up like a VM, but it was instantaneous. Where is the OS in that setup?</strong></p><p><strong>Solomon Hykes: </strong>Great question. Docker relies on Linux. When you're on a Mac, it runs Linux behind the scenes&#8212;today via virtualization. Back then, we used lots of early, rough tools and kernel patches to make Linux containers work. Docker put all the pieces together in a coherent way.</p><p><strong>Tristan Handy: So containerization wasn&#8217;t new, but Docker made it accessible?</strong></p><p><strong>Solomon Hykes:</strong><br>Exactly. The Linux kernel had features like namespaces and cgroups&#8212;building blocks for containers. But they weren&#8217;t user-friendly. We made a developer-centric abstraction on top of those tools.</p><p>And Linux provided a massive compatibility layer. Unlike Java, which required writing your app in Java, Docker containers could wrap apps written in any language, as long as they ran on Linux.</p><p><strong>Tristan Handy: So Docker is like infrastructure as code&#8212;a primitive that enables the whole concept?</strong></p><p><strong>Solomon Hykes: </strong>Yes! And because we wanted ubiquity, we avoided pushing too many opinions. We let developers build on top of it in many different ways. That&#8217;s what helped Docker become a de facto standard.</p><p><strong>Tristan Handy: How fragmented is the Linux world under the hood? Did you have to do much abstraction work?</strong></p><p><strong>Solomon Hykes: </strong>We were lucky. The Linux kernel is extremely stable and consistent. But everything above it&#8212;distros, package managers, tooling&#8212;was chaotic. That chaos created the opportunity for Docker to provide a consistent experience.</p><p><strong>Tristan Handy: Were there any drawbacks? Like &#8220;Docker sprawl&#8221; the way VMware saw VM sprawl?</strong></p><p><strong>Solomon Hykes: </strong>Definitely. With power comes chaos. Teams would run dozens of Docker containers, each configured differently. Docker doesn&#8217;t enforce opinions&#8212;by design.</p><p><strong>Tristan Handy: And what happened when you left Docker in 2018?</strong></p><p><strong>Solomon Hykes: </strong>I took time off, became a full-time dad. But I also realized how many unsolved problems remained. Especially around CI/CD pipelines and software delivery&#8212;what we now call the software factory.</p><p>That led me to start Dagger.</p><p><strong>Tristan Handy: So Dagger is like &#8220;containers for pipelines&#8221;?</strong></p><p><strong>Solomon Hykes: </strong>Yes. Just as Docker standardized app deployment, Dagger aims to standardize and containerize software delivery. CI/CD pipelines today are often duct-taped together with YAML and bash scripts. We&#8217;re bringing consistency and modularity to that space.</p><p><strong>Tristan Handy: Will there be a &#8220;Daggerfile&#8221; like there&#8217;s a Dockerfile?</strong></p><p><strong>Solomon Hykes: </strong>Sort of. But this time, we&#8217;re opinionated. Dagger is narrowly focused on CI/CD. That lets us provide APIs, SDKs, and a deeper abstraction stack. We give platform engineers a DAG-based system to define repeatable, containerized steps.</p><p><strong>Tristan Handy: And what&#8217;s the role of AI and agents in all this?</strong></p><p><strong>Solomon Hykes: </strong>Great question. We didn&#8217;t plan for it, but our community showed us the way. People started building AI agents that run in Dagger pipelines&#8212;automating things like writing tests, submitting PRs, and optimizing builds.</p><p>That blew our minds. Agents blur the line between development and delivery. They need programmable environments. Dagger is becoming an ideal platform for that.</p><h2>Chapters</h2><p><strong>01:30 &#8211; Early Days: From France to dotCloud</strong></p><p>Solomon shares how his early programming experience and startup journey led to the creation of dotCloud.</p><p><strong>04:00 &#8211; The PaaS Struggle and Birth of Docker</strong></p><p>The team pivots from platform-as-a-service to focusing on the container engine itself&#8212;what would become Docker.</p><p><strong>07:00 &#8211; What Is a Container, Really?</strong></p><p>Solomon explains containerization vs. virtualization in plain terms and why it changed the game for developers.</p><p><strong>11:00 &#8211; The Developer Experience That Won the World</strong></p><p>The magic of fast, lightweight Docker containers&#8212;and how that first &#8220;wow&#8221; moment felt.</p><p><strong>14:00 &#8211; Building a Ubiquitous Standard</strong></p><p>Why Docker stayed narrow by design, resisting feature bloat to maximize compatibility.</p><p><strong>18:00 &#8211; DevOps Before DevOps</strong></p><p>How Docker avoided language tribalism and achieved mass developer adoption by choosing Go and CLI-first tooling.</p><p><strong>21:00 &#8211; Complexity and Container Sprawl</strong></p><p>Docker made infrastructure easy&#8212;but created new operational challenges at scale.</p><p><strong>24:30 &#8211; Why CI/CD Pipelines Are Still Broken</strong></p><p>Solomon outlines the gap Docker never got to fix: modern software delivery remains brittle and ad hoc.</p><p><strong>27:00 &#8211; Enter Dagger: DevOps for the Modern Age</strong></p><p>How Solomon&#8217;s new company is treating pipelines as composable software, not brittle scripts.</p><p><strong>30:00 &#8211; Building an OS for the Software Factory</strong></p><p>Dagger helps platform teams manage the complexity of software delivery with reusable, testable components.</p><p><strong>33:00 &#8211; Agent-Native Workflows: A Surprise Use Case</strong></p><p>AI agents begin using Dagger to reason about pipelines, generate code, and submit pull requests autonomously.</p><p><strong>37:00 &#8211; Reimagining the Dev Loop with AI</strong></p><p>Why the boundary between development and CI/CD is collapsing&#8212;and how Dagger fits the agent-powered future.</p><p><strong>41:00 &#8211; Scaling Trust in Delivery</strong></p><p>Tristan and Solomon reflect on how developer tooling evolves and what a stable, fast delivery layer enables.</p><p><strong>45:00 &#8211; Final Thoughts: What&#8217;s Next for DevOps</strong></p><p>The conversation closes with predictions on intelligent automation, composability, and the future of platform engineering.</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The history and future of the data ecosystem (w/ Lonne Jaffe)]]></title><description><![CDATA[Mainframes, relational databases, ETL, Hadoop, the cloud, and all of it]]></description><link>https://roundup.getdbt.com/p/the-history-and-future-of-the-data</link><guid isPermaLink="false">https://roundup.getdbt.com/p/the-history-and-future-of-the-data</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 08 Jun 2025 13:02:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2a174d40-0d03-4fc7-a541-830573130b6e_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this decades-spanning episode, Tristan talks with Lonne Jaffe, Managing Director at Insight Partners and former CEO of Syncsort (now Precisely), to trace the history of the data ecosystem&#8212;from its mainframe origins to its AI-infused future.</p><p>Lonne reflects on the evolution of ETL, the unexpected staying power of legacy tech, and why AI may finally erode the switching costs that have long protected incumbents. The future of the AI and standards era is bright. </p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h2>Episode chapters</h2><p><strong>00:46 &#8211; Meet Lonne Jaffe: background &amp; career jurney</strong></p><p>Lonne shares his career highlights from Insight Partners, Syncsort/Precisely, and IBM, including major acquisitions and tech focus areas.</p><p><strong>04:20 &#8211; The origins of Syncsort &amp; sorting in mainframes</strong></p><p>Discussion on why sorting was a critical early problem in hierarchical databases and how early systems like IMS worked.</p><p><strong>07:00 &#8211; M&amp;A as innovation strategy</strong></p><p>How Syncsort used inorganic growth to modernize its platform, including an early example of migrating data from IMS to DB2 without rewriting apps.</p><p><strong>09:35 &#8211; Technical vs. strategic experience</strong></p><p>Tristan probes Lonne&#8217;s technical depth despite his business titles; Lonne shares his background in programming and a fun fact about juggling.</p><p><strong>11:55 &#8211; Why this history matters</strong></p><p>Tristan sets up the key question: what lessons from 1970s-2000s ETL tooling still shape the modern data stack?</p><p><strong>13:00 &#8211; Proto-ETL: The real OGs</strong></p><p>Lonne traces the origins of ETL to 1970s CDC, JCL, and early IBM tools. Prism Solutions in 1988 gets credit as the first real ETL startup.</p><p><strong>15:40 &#8211; Rise of the ETL market (1990s)</strong></p><p>From Prism to Informatica and DataStage&#8212;early 90s vendors brought visual development to what was once COBOL-heavy backend work.</p><p><strong>18:00 &#8211; Why people offloaded Teradata to Hadoop</strong></p><p>Exploring how cost, contention, and capacity drove ETL out of the warehouse and into Hadoop in the 2000s.</p><p><strong>20:00 &#8211; Performance vs. price: Jevons Paradox in ETL</strong></p><p>Why lower compute and storage costs led to <em>more</em> ETL, not less&#8212;and how parallelization changed the game.</p><p><strong>22:30 &#8211; Evolution of data management suites</strong></p><p>How ETL expanded into app-to-app integration, catalogs, metadata management, and why these bundles got bloated.</p><p><strong>25:00 &#8211; Rise of data prep &amp; self-service analytics</strong></p><p>Tools like Kettle, Pentaho, and Tableau mirrored ETL for business users&#8212;spawning a whole &#8220;data prep&#8221; category.</p><p><strong>27:30 &#8211; Clickstream, logs &amp; big data chaos</strong></p><p>How clickstream and log data changed the ETL landscape, and the hope (and letdown) of zero-copy analytics.</p><p><strong>29:10 &#8211; Why is old software so sticky?</strong></p><p>Tristan and Lonne explore the economics of switching costs, the illusion of freedom, and whether GenAI could break the lock-in.</p><p><strong>33:30 &#8211; Are old tools actually&#8230; good?</strong></p><p>Defending mainframes and 30-year-old databases like Cache. Sometimes the mature option is better&#8212;just not sexy.</p><p><strong>36:00 &#8211; The new vs. the durable</strong></p><p>Modern tools must prove themselves against decades of reliability and robustness in finance, healthcare, and compliance.</p><p><strong>38:20 &#8211; GenAI in data: The early movers</strong></p><p>Lonne highlights why companies like Atlan and dbt Labs are in the best position to win&#8212;distribution, trust, and product maturity.</p><p><strong>41:00 &#8211; TAM and the Jevons Paradox, again</strong></p><p>Revisiting how price drops expand TAM. Some categories vanish, others explode&#8212;depending on elasticity of demand.</p><p><strong>43:15 &#8211; Unlocking new personas with LLMs</strong></p><p>Structured data access for non-technical users is finally viable, but &#8220;it has to be right&#8221;&#8212;trust and quality remain the barrier.</p><p><strong>46:00 &#8211; Real-world examples: dbt&#8217;s MCP server win</strong></p><p>Tristan shares how dbt&#8217;s Metadata API became a catalog replacement for a traditional financial institution&#8212;an unplanned AI GTM success.</p><p><strong>48:30 &#8211; Agents, not interfaces</strong></p><p>New pattern: LLMs as agents interacting directly with infrastructure via APIs. Tool use is becoming table stakes for AI integration.</p><p><strong>50:30 &#8211; Are LLMs birthright tools yet?</strong></p><p>Discussion around adoption of ChatGPT Enterprise, Claude, etc. Lonne suggests adoption is accelerating fast&#8212;and the usage model matters.</p><p><strong>52:00 &#8211; Looking ahead</strong></p><p>The conversation ends with a reflection on GenAI&#8217;s near future in data workflows, TAM expansion, and what the next episode might tackle.</p><div><hr></div><h2>Key takeaways from this episode</h2><p><strong>Tristan Handy: You've had a long career in tech. Maybe start by giving us the 30,000-foot view of what you've been up to over the last couple decades?</strong></p><p><strong>Lonne Jaffe:</strong> I&#8217;ve been at Insight Partners for about eight years now, working mostly on deep tech investments&#8212;AI infrastructure companies like Run AI and <a href="http://Deci.ai">deci.ai</a>, both acquired by Nvidia. I&#8217;ve also done work with data infrastructure companies like SingleStore. Before Insight, I was CEO of a portfolio company called Syncsort, now Precisely. It was founded in 1968.</p><p>Prior to that, I was at IBM for 13 years, working in middleware and mainframe technologies. Products like WebSphere, CICS, and TPF&#8212;foundational systems for enterprise computing.</p><p><strong>Tristan Handy: And Syncsort's origin was in sorting, right? Literally sorting files?</strong></p><p><strong>Lonne Jaffe:</strong> Exactly. In the early days of computing, sorting was a huge part of what you did. Much of the data was hierarchical&#8212;stored in IMS&#8212;and had to be flattened into files to process. The algorithms were optimized to run in extremely resource-constrained environments.</p><p><strong>Tristan Handy: Fascinating. And I assume as compute and storage improved, the data integration landscape evolved?</strong></p><p><strong>Lonne Jaffe:</strong> Yes. We saw a move from hierarchical to relational databases, then toward ETL tools in the 80s and 90s. The first real ETL startup was probably Prism Solutions in 1988. Informatica and DataStage showed up in the early 90s, followed by Talend and others.</p><p><strong>Tristan Handy: It seems like we got a whole bundle of tools over time&#8212;ETL, CDC, app integration, metadata, and so on.</strong></p><p><strong>Lonne Jaffe:</strong> Yes, often bundled together, even though data prep and app integration were treated separately. That persisted for longer than you'd expect. At Syncsort, we acquired a company with a "transparency" solution that allowed IMS applications to use data stored in DB2 without rewriting code&#8212;a clever way to manage switching costs.</p><p><strong>Tristan Handy: Speaking of switching costs&#8212;why are these legacy tools so sticky?</strong></p><p><strong>Lonne Jaffe:</strong> Great question. In many cases, no customer loves the product. They&#8217;d switch in a heartbeat&#8212;if it were easy. But rewriting jobs and ensuring reliability is a heavy lift. The best outcome is a new system that replicates old functionality. And for many organizations, that&#8217;s not worth the risk.</p><p><strong>Tristan Handy: But if generative AI could reduce those switching costs?</strong></p><p><strong>Lonne Jaffe:</strong> That&#8217;s the potential. Code generation, agents that explore and iterate&#8212;those could erode the moat that&#8217;s protected these incumbents for decades. Not tomorrow, but it&#8217;s a real possibility.</p><p><strong>Tristan Handy: It also seems like some of these systems are more robust than people give them credit for.</strong></p><p><strong>Lonne Jaffe:</strong> Absolutely. Mainframes are IO supercomputers. Products like InterSystems Cache, used by Epic, are incredibly performant. But new systems must match or exceed those capabilities in reliability and scale, which is a high bar.</p><p><strong>Tristan Handy: As you look at the evolution of the modern data stack, how do you think about its impact on the market?</strong></p><p><strong>Lonne Jaffe:</strong> In the 2010s, we saw disaggregation&#8212;tools like Fivetran, dbt, and Snowflake each tackled a slice of the old enterprise bundle. But the TAM isn&#8217;t infinite. Some categories may compress or vanish entirely if price drops aren&#8217;t offset by new demand.</p><p><strong>Tristan Handy: Do you think AI expands or compresses the data stack?</strong></p><p><strong>Lonne Jaffe:</strong> It depends. High elasticity of demand&#8212;like with dashboards or analytics&#8212;can drive massive TAM expansion. But some categories, like logo redesign or simple data movement, might get commoditized. For more complex workflows, AI agents accessing platforms like dbt or Atlan could dramatically increase value by automating common tasks and enabling new personas.</p><p><strong>Tristan Handy: We&#8217;ve seen an example already&#8212;a customer replaced their data catalog with our dbt Cloud metadata server and AI interface.</strong></p><p><strong>Lonne Jaffe:</strong> That&#8217;s telling. If AI interfaces can connect to tools like dbt and generate value&#8212;self-service, documentation, lineage&#8212;it changes the game. Especially for organizations already standardized on those platforms.</p><p><strong>Tristan Handy: What&#8217;s your view on how these AI interfaces get distributed?</strong></p><p><strong>Lonne Jaffe:</strong> ChatGPT Enterprise, Claude, and others are spreading fast. Eventually, you&#8217;ll want those tools to search files, access internal metadata, and interact with your data stack&#8212;not just answer questions from the open web.</p><p><strong>Tristan Handy: It makes a lot of sense. If AI is going to serve enterprise users, it needs access to the real data. Otherwise, it&#8217;s just a toy.</strong></p><p><strong>Lonne Jaffe:</strong> Exactly. A model that can&#8217;t query or verify against your actual environment won&#8217;t be reliable. And data quality and observability&#8212;something dbt Cloud is already good at&#8212;become foundational.</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Everything terminals (w/ Zach Lloyd)]]></title><description><![CDATA[The universal integration layer...the command line? Tristan talks terminals with Zach Lloyd, the founder of Warp]]></description><link>https://roundup.getdbt.com/p/everything-terminals-w-zach-lloyd</link><guid isPermaLink="false">https://roundup.getdbt.com/p/everything-terminals-w-zach-lloyd</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Sun, 25 May 2025 13:01:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d276a840-287d-4ae3-882b-42115f46cfc5_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>In this episode, Tristan talks with Zach Lloyd, founder of <a href="https://www.warp.dev/">Warp</a>&#8212;a terminal built for the modern era, including for AI agents. They explore the history of terminals, differences between terminals and shells, and what the future might look like. In a world driven by generative AI, the terminal could once again be the control center of computer usage.</p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p>Join Tristan May 28 at the <strong><a href="https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase/?utm_medium=event&amp;utm_source=podcast&amp;utm_campaign=q2-2026_dbt-launch-showcase-2025_aw&amp;utm_content=____&amp;utm_term=all___">2025 dbt Launch Showcase</a></strong> for the latest features landing in dbt to empower the next era of analytics. We'll see you there.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase/?utm_medium=event&amp;utm_source=podcast&amp;utm_campaign=q2-2026_dbt-launch-showcase-2025_aw&amp;utm_content=____&amp;utm_term=all___" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fIuL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 424w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 848w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1272w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fIuL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1929918,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase/?utm_medium=event&amp;utm_source=podcast&amp;utm_campaign=q2-2026_dbt-launch-showcase-2025_aw&amp;utm_content=____&amp;utm_term=all___&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/163234704?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fIuL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 424w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 848w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1272w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h3>Chapters</h3><ul><li><p><strong>01:00 &#8211; Introducing Warp and Zach Lloyd</strong></p><ul><li><p>Zach Lloyd explains Warp's origin, mission, and initial vision.</p></li></ul></li><li><p><strong>02:40 &#8211; Why redesign the terminal?</strong></p><ul><li><p>Zach describes why traditional terminal UX was ripe for reinvention.</p></li></ul></li><li><p><strong>04:43 &#8211; Enter LLMs: A new direction for Warp</strong></p><ul><li><p>Warp evolves into a natural language interface for developer workflows.</p></li></ul></li><li><p><strong>06:34 &#8211; What is a shell?</strong></p><ul><li><p>Zach defines shells, how they process text, and their role in the CLI ecosystem.</p></li></ul></li><li><p><strong>07:58 &#8211; Shells vs programs vs built-ins</strong></p><ul><li><p>Distinguishing between shell commands and standalone programs.</p></li></ul></li><li><p><strong>10:00 &#8211; Why do developers debate shells?</strong></p><ul><li><p>Features, syntax, and licensing behind the Bash vs Z Shell discussion.</p></li></ul></li><li><p><strong>12:17 &#8211; Why terminals still matter</strong></p><ul><li><p>The enduring power of text-based computing and scripting.</p></li></ul></li><li><p><strong>16:40 &#8211; What is a terminal, really?</strong></p><ul><li><p>Clarifying the difference between terminal hardware, emulators, and modern terminal apps.</p></li></ul></li><li><p><strong>20:13 &#8211; The Warp interface</strong></p><ul><li><p>Zach breaks down Warp&#8217;s UI: input editor, output blocks, and mouse support.</p></li></ul></li><li><p><strong>22:48 &#8211; Will Warp replace your IDE?</strong></p><ul><li><p>The vision of AI-driven development and the convergence of terminal, editor, and chat.</p></li></ul></li><li><p><strong>27:20 &#8211; Rethinking development interfaces</strong></p><ul><li><p>Finding the ideal hub for AI-native software development.</p></li></ul></li><li><p><strong>35:00 &#8211; Why the terminal has an edge</strong></p><ul><li><p>Advantages of the terminal for cross-project, full-lifecycle developer tasks.</p></li></ul></li><li><p><strong>37:10 &#8211; Bottom-up adoption strategy</strong></p><ul><li><p>How Warp approaches growth: focus on individual developers, not top-down mandates.</p></li></ul></li><li><p><strong>39:50 &#8211; Is Warp redefining the terminal?</strong></p><ul><li><p>The challenges of innovating in a legacy-dominated space and creating a new category.</p></li></ul></li><li><p><strong>42:45 &#8211; Developer control &amp; context in Warp</strong></p><ul><li><p>Customization, context-awareness, and MCP integration in Warp&#8217;s AI tooling.</p></li></ul></li><li><p><strong>46:32 &#8211; Closing reflections</strong></p><ul><li><p>Zach and Tristan wrap up their thoughts on the future of terminals, AI, and developer tools.</p></li></ul></li></ul><h2>Key takeaways from this episode</h2><p><strong>Tristan Handy: Can you tell us about Warp, where the idea came from, and where you&#8217;re at today?</strong></p><p><strong>Zach Lloyd:</strong> Warp reimagines the command line to make it more approachable, powerful, and useful for developers. I've been a software engineer for over 20 years and always used the terminal, but never understood why it worked the way it did. I used to learn the minimum I needed and rely on team members when I ran into issues.</p><p>After my last startup, I looked at tools I used frequently that could have a big impact if improved. The terminal stood out. I realized better UX&#8212;like being able to use a mouse to position the cursor or select output for copy-paste&#8212;could unlock a lot of productivity. That was the initial idea about five years ago.</p><p>We spent the first couple of years redesigning the interface. Today, Warp is more than a terminal&#8212;it's a natural language interface to the command line, powered by large language models (LLMs). You can use it to set up projects, write code, debug production, and more.</p><p><strong>Tristan: I want to dig into fundamentals. Can you define what a shell is?</strong></p><p><strong>Zach:</strong> A shell is a program that parses text input, runs commands, and returns text output. You can run it interactively or through scripts. Terminals, by contrast, are the graphical layer that displays text and captures keyboard input. Shells like Bash, Z Shell, and Fish offer different features, syntaxes, and configurations. Some programs like <code>cp</code> are shell built-ins, which don&#8217;t require forking new processes.</p><p><strong>Tristan: Why do terminals persist in a GUI-dominated world?</strong></p><p><strong>Zach:</strong> A few reasons. First, it&#8217;s easier to write command-line apps than GUI apps. Second, the interface is infinitely flexible&#8212;you can pass endless flags and parameters. Third, command-line programs interoperate cleanly via text streams. And lastly, they&#8217;re scriptable. Developers can automate repetitive workflows easily, which is powerful.</p><p><strong>Tristan: So a terminal just runs a shell. But I never think of terminals as having features. What makes a terminal more than a simple interface?</strong></p><p><strong>Zach:</strong> Terminals emulate old hardware&#8212;keyboards and text displays. Today&#8217;s terminal apps are GUI shells that simulate this behavior. Most are "dumb terminals," just rendering characters. But they can support features like theming, control characters for advanced UI (e.g., in Vim), and even bitmap rendering.</p><p><strong>Tristan: Warp looks very different. Can you describe it?</strong></p><p><strong>Zach:</strong> Warp looks more like a chat or notebook interface. Each command's output is grouped in a logical block instead of being dumped in a scroll. The input area behaves more like a code editor, with syntax highlighting and first-class mouse support. We're aiming for modern UX.</p><p><strong>Tristan: So you're blending terminal, editor, and chat. Will people eventually write all their code in Warp?</strong></p><p><strong>Zach:</strong> My vision is that developers will increasingly describe what they want in natural language, and agents will do the work. Developers supervise the results. That interface needs to support managing many tasks at once. That&#8217;s what we&#8217;re building towards. It won&#8217;t even be called a terminal&#8212;it&#8217;s a new category of software.</p><p><strong>Tristan: The boundaries between these tools are blurring. And maybe the best interface for AI-assisted development isn't an IDE or chat app&#8212;it could be the terminal.</strong></p><p><strong>Zach:</strong> The terminal spans all phases of development&#8212;from setup to deployment and debugging. It also supports cross-project work, which IDEs don&#8217;t. That&#8217;s a huge strength.</p><p><strong>Tristan: But terminals are a personal choice. How do you think about adoption and your business model?</strong></p><p><strong>Zach:</strong> Like editors, terminals are developer-choice tools. We don&#8217;t go top-down. Our motion is bottoms-up: get individuals to love Warp, then expand into teams and enterprises for security, privacy, and data controls.</p><p><strong>Tristan: Are you trying to reset the baseline for what a terminal is?</strong></p><p><strong>Zach:</strong> We're not open source, though we&#8217;ve considered it. It&#8217;s risky. But our focus isn&#8217;t on redefining "the terminal." It&#8217;s on building the best tool for developers to ship software. That might require a new category name.</p><p><strong>Tristan: What&#8217;s the dev experience in Warp like? Is it customizable?</strong></p><p><strong>Zach:</strong> We support theming and shortcuts. But the most important part is AI context. Warp can use any CLI tool to gather context&#8212;GitHub CLI, GCloud, etc. We&#8217;re also implementing the Model Context Protocol (MCP) and plan to better support custom/internal tools as well.</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Why compilers matter (w/ Lukas Schulte)]]></title><description><![CDATA[We continue our season on developer experience by looking at compilers with the SDF Labs cofounder.]]></description><link>https://roundup.getdbt.com/p/why-compilers-matter-w-lukas-schulte</link><guid isPermaLink="false">https://roundup.getdbt.com/p/why-compilers-matter-w-lukas-schulte</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Mon, 12 May 2025 12:02:26 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/fab3c2ea-0b19-4b35-a887-c779cff0e8d3_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Tristan Handy dives deep into the world of compilers in this episode of The Analytics Engineering Podcast with Lukas Schulte, cofounder of SDF Labs (not to be confused with <a href="https://roundup.getdbt.com/p/the-evolution-of-databases-w-wolfram">last episode&#8217;s guest&#8212;Lukas&#8217; dad and fellow SDF cofounder Wolfram Schulte</a>). Tristan and Lukas discuss what compilers are, how they work, and what they mean for the data ecosystem. SDF, which was <a href="https://www.getdbt.com/blog/dbt-labs-acquires-sdf-labs">recently acquired by dbt Labs</a>, builds a world-class SQL compiler aimed at abstracting away the complexity of warehouse-specific SQL.</p><p>The conversation covers the evolution of compiler technology, what software engineering has gotten right over the past several decades, and w<a href="https://www.getdbt.com/blog/how-ai-will-disrupt-data-engineering">hy the data ecosystem is poised for similar transformation</a>. Lucas and Tristan explore why SQL has lagged behind other programming ecosystems, and how new compiler infrastructure could lead to package management, interoperability, and greater innovation across data platforms. It&#8217;s a fascinating (and timely) episode: <a href="https://www.getdbt.com/blog/how-to-get-ready-for-the-new-dbt-engine">Get ready for the new dbt engine</a>.   </p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p>Join Tristan May 28 at the <strong>2025 dbt Launch Showcase</strong> for the latest features landing in dbt to empower the next era of analytics. We'll see you there.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase/?utm_medium=event&amp;utm_source=podcast&amp;utm_campaign=q2-2026_dbt-launch-showcase-2025_aw&amp;utm_content=____&amp;utm_term=all___" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fIuL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 424w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 848w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1272w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fIuL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png" width="1456" height="364" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:364,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1929918,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:&quot;https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase/?utm_medium=event&amp;utm_source=podcast&amp;utm_campaign=q2-2026_dbt-launch-showcase-2025_aw&amp;utm_content=____&amp;utm_term=all___&quot;,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/163234704?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fIuL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 424w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 848w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1272w, https://substackcdn.com/image/fetch/$s_!fIuL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8da12175-c97f-4e70-abf7-6f1c3d887f40_2400x600.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:false,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h3>Chapters</h3><ul><li><p>02:40 The vision behind SDF Labs</p></li><li><p>04:00 What is a compiler?</p></li><li><p>05:00 Components of a compiler: frontend, IR, backend</p></li><li><p>08:00 Syntax vs. semantics and the role of parsing</p></li><li><p>10:00 Logical vs. physical plans in SQL compilers</p></li><li><p>13:00 Historical context: mainframes to LLVM</p></li><li><p>16:00 Cross-architecture portability in Rust &amp; other compilers</p></li><li><p>18:00 What is LLVM and why it matters</p></li><li><p>20:00 Bootstrapping and the self-recursive nature of compilers</p></li><li><p>21:00 Compilers in Java, TypeScript, and dbt</p></li><li><p>23:00 Why compilers are foundational to software ecosystems</p></li><li><p>26:00 The SQL dialect problem in data warehouses</p></li><li><p>29:00 Can SQL get its own LLVM?</p></li><li><p>31:00 How Substrate and DataFusion aim to standardize SQL</p></li><li><p>35:00 Package management and the path toward SQL abstractions</p></li><li><p>38:00 The future of the data ecosystem with a common SQL compiler</p></li></ul><h2>Key takeaways from this episode</h2><h3>What is a compiler?</h3><p><strong>Tristan Handy:</strong> What is a compiler?</p><p><strong>Lukas Schulte:</strong> It's something that takes higher-level human-readable code and translates, compiles, rewrites it into lower-level machine code that is much harder for humans to understand and much easier for machines to understand.</p><p>Compilers typically have phases. They have a frontend that deals with the language you're working with, a middle component&#8212;usually called an IR or intermediate representation&#8212;and a backend that takes that IR and compiles it into machine code.</p><h3>Compiler phases: frontend, IR, backend</h3><p><strong>Tristan Handy:</strong> How does it all come together?</p><p><strong>Lukas Schulte:</strong> There&#8217;s a preprocessor that handles macros, removes comments, and prepares the text. Then a lexer converts it into tokens. These tokens get assembled into a tree that the compiler can understand. That&#8217;s where syntax validation and semantic analysis happen.</p><p>From there, we build a logical representation of the operations we want to perform. That transitions to a physical plan, which starts considering the hardware: how many cores, how much memory, which files we&#8217;re accessing. After that, optimizations are applied and it compiles to actual machine code using a toolchain like LLVM.</p><h3>Syntax vs. semantics</h3><p><strong>Lukas Schulte:</strong> Let&#8217;s break down syntax vs. semantics.</p><p>Imagine the code<code> x = x + 1</code>. That has valid syntax. Its meaning&#8212;its semantics&#8212;is that we&#8217;re incrementing <code>x</code> by 1.</p><p>Now, you could also write <code>x += 1</code>. Different syntax, same semantics. So syntax defines structure, and semantics define meaning. That distinction is important when you&#8217;re analyzing or transforming code.</p><h3>LLVM and portability</h3><p><strong>Tristan Handy:</strong> Have we been building abstraction layers like this for decades?</p><p><strong>Lukas Schulte:</strong> Absolutely. That&#8217;s what LLVM does. It provides a consistent intermediate representation that compilers can use to target multiple backends&#8212;Intel, ARM, different OSes. Apple invested early in LLVM to support custom chips.</p><p>With Rust, for example, LLVM is what lets us build binaries that behave the same on macOS, Windows, and Linux with relatively little effort.</p><h3>Bootstrapping compilers</h3><p><strong>Tristan Handy:</strong> So there&#8217;s this recursive loop&#8212;compilers being built with other compilers?</p><p><strong>Lukas Schulte:</strong> Exactly. Rust wasn&#8217;t always written in Rust&#8212;it started in C++. Eventually, the compiler was rewritten in Rust itself. Now, Rust compiles Rust. It&#8217;s fully self-hosted. That&#8217;s common with mature languages&#8212;it shows the compiler ecosystem is stable and powerful enough to sustain itself.</p><h3>Why compilers matter</h3><p><strong>Tristan Handy:</strong> You said once that compilers are the foundation of every software ecosystem. What did you mean?</p><p><strong>Lukas Schulte:</strong> There are two big drivers in software: abstractions and standards. You want one way to interface with a USB device&#8212;not ten. Same for software. You want one standard way to express a Python program, a JavaScript app, etc.</p><p>Compilers enforce those standards and make sure the same code works across platforms. That consistency powers things like package managers, shared libraries, and open ecosystems.</p><h3>SQL dialects and fragmentation</h3><p><strong>Tristan Handy:</strong> Are there ecosystems that are doing worse than others?</p><p><strong>Lukas Schulte:</strong> SQL does a particularly bad job. Anyone who's used more than one data warehouse knows you can't take the same SQL statement and expect it to work the same way. Casting, case sensitivity, functions&#8212;every engine handles these things differently.</p><h3>Toward a universal SQL compiler</h3><p><strong>Tristan Handy:</strong> Can you convince me this problem is solvable?</p><p><strong>Lukas Schulte:</strong> Yes. That's what we're working on with SDF&#8212;creating a shared intermediate representation for SQL. If we can express SQL logic in a unified form, we can compile it to any dialect&#8212;BigQuery, Snowflake, Redshift, and so on.</p><p>That allows developers to build reusable libraries, just like in other languages. It also makes governance, validation, and testing easier.</p><h3>Future of data ecosystems</h3><p><strong>Tristan Handy:</strong> What would that future look like for practitioners?</p><p><strong>Lukas Schulte:</strong> One major change would be the emergence of robust SQL libraries. Today, there&#8217;s no <code>import</code> system for SQL. Everyone writes similar logic over and over.</p><p>A shared compiler abstraction would let us reuse components, collaborate across companies, and build an ecosystem of packages for transformations, metrics, and validations&#8212;similar to how we use NPM or PyPI.</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The evolution of databases (w/ Wolfram Schulte)]]></title><description><![CDATA[In the first episode of our season on developer experience, the cofounder and CTO of SDF Labs, now a part of dbt Labs, discusses databases, compilers, and dev tools.]]></description><link>https://roundup.getdbt.com/p/the-evolution-of-databases-w-wolfram</link><guid isPermaLink="false">https://roundup.getdbt.com/p/the-evolution-of-databases-w-wolfram</guid><dc:creator><![CDATA[Dan Poppy]]></dc:creator><pubDate>Mon, 28 Apr 2025 12:02:43 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e4839270-d17c-40d0-94d3-06ac3a969b0f_1680x1200.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3>Summary</h3><p>Welcome to our new season of The Analytics Engineering Podcast. This season, we&#8217;re focusing on developer experience. We&#8217;ll explore the developer experience by tracing the lineage of foundational software tools, platforms, and frameworks. From compilers to modern cloud infrastructure and data systems, we&#8217;ll unpack how each layer of the stack shapes the way developers build, collaborate, and innovate today. It&#8217;s a theme that lends itself to a lot of great conversations on where we&#8217;ve come from and where we&#8217;re headed.</p><p>In our first episode of the season, Tristan talks with Wolfram Schulte. Wolfram is a distinguished engineer at dbt Labs. He joined the company via the <a href="https://www.getdbt.com/blog/dbt-labs-acquires-sdf-labs">acquisition of SDF Labs</a> <a href="https://docs.getdbt.com/blog/sql-comprehension-technologies">Labs</a>, where he was <a href="https://www.getdbt.com/blog/building-the-next-gen-dbt-engine">co-founder and CTO</a>. He spent close to two decades in Microsoft Research and several years at Meta building their data platform.</p><p>One of the amazing things about Wolfram is his love of teaching others the things that he's passionate about. In this episode, he discusses the internal workings of data systems. He and Tristan talk about <a href="https://docs.getdbt.com/blog/the-levels-of-sql-comprehension">SQL parsers</a>, <a href="https://roundup.getdbt.com/p/the-power-of-a-plan-how-logical-plans">compilers</a>, <a href="https://docs.getdbt.com/blog/sql-comprehension-technologies">execution engines</a>, <a href="https://www.getdbt.com/resources/guides/the-analytics-development-lifecycle">composability</a>, and the world of heterogeneous compute that we're all headed towards. While some of this might seem a little sci-fi, it&#8217;s likely right around the corner. And Wolfram is inventing some of the tech that's going to get us there.</p><div><hr></div><p>Join Tristan May 28 at the <strong><a href="https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase">2025 dbt Launch Showcase</a></strong> for the latest features landing in dbt to empower the next era of analytics. We'll see you there.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase&quot;,&quot;text&quot;:&quot;Register now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase"><span>Register now</span></a></p><p><em>Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.</em> </p><div><hr></div><p><strong>Listen &amp; subscribe from:</strong></p><iframe class="spotify-wrap podcast" data-attrs="{&quot;image&quot;:&quot;https://i.scdn.co/image/ab6765630000ba8a2f8724bfe318715a7c00c406&quot;,&quot;title&quot;:&quot;The Analytics Engineering Podcast&quot;,&quot;subtitle&quot;:&quot;dbt Labs, Inc.&quot;,&quot;description&quot;:&quot;Podcast&quot;,&quot;url&quot;:&quot;https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE&quot;,&quot;belowTheFold&quot;:true,&quot;noScroll&quot;:false}" src="https://open.spotify.com/embed/show/4BKMMeVXk4jJnAQSqGSJvE" frameborder="0" gesture="media" allowfullscreen="true" allow="encrypted-media" loading="lazy" data-component-name="Spotify2ToDOM"></iframe><ul><li><p><a href="https://open.spotify.com/show/4BKMMeVXk4jJnAQSqGSJvE">Spotify</a></p></li><li><p><a href="https://podcasts.apple.com/us/podcast/the-analytics-engineering-podcast/id1574755368">Apple Podcasts</a></p></li><li><p><a href="https://music.amazon.com/podcasts/333fe811-1b14-499c-b609-9bfb8f06d1ae/the-analytics-engineering-podcast">Amazon Music</a></p></li><li><p><a href="https://tunein.com/podcasts/Technology-Podcasts/The-Analytics-Engineering-Podcast-p1466362/">TuneIn</a></p></li><li><p><a href="https://analyticsengineeringroundup.libsyn.com/rss">RSS feed</a></p></li></ul><h3>Chapters</h3><ul><li><p>01:35 Introduction to dbt Labs and SDF Labs collaboration </p></li><li><p>04:42 Wolfram's journey from monastery to tech innovator </p></li><li><p>07:55 The role of compilers in database technology </p></li><li><p>11:05 Building efficient engineering systems at Microsoft </p></li><li><p>14:13 Navigating data complexity at Facebook </p></li><li><p>18:51 Understanding database components and their importance </p></li><li><p>24:44 The shift from row-based to column-based Storage </p></li><li><p>27:40 Emergence of modular databases </p></li><li><p>28:44 The rise of multimodal databases </p></li><li><p>30:45 The role of standards in data management </p></li><li><p>35:04 Balancing optimization and interoperability </p></li><li><p>36:38 Conceptual buckets for database engines </p></li><li><p>38:46 DataFusion compared to DuckDB</p></li><li><p>40:44 ClickHouse </p></li><li><p>44:20 Bridging the gap between SQL and new technologies </p></li><li><p>50:55 The future of developer experience</p></li></ul><h2>Key takeaways from this episode</h2><h3>From monastery to Microsoft: Wolfram&#8217;s journey</h3><p><strong>Tristan Handy: Can you walk us through the Wolfram Schulte origin story?</strong></p><p><strong>Wolfram Schulte: </strong>I was born in rural Germany&#8212;Sauerland&#8212;and ended up in a monastery boarding school after my father passed away. Their goal was to train monks and priests, but that didn&#8217;t stick for me.</p><p>Later I went to Berlin&#8212;back then you had to cross East Germany to get there&#8212;and began studying physics. But I realized everyone else understood physics better than I did! One day I walked past a lecture on data structures and algorithms, and I was hooked. I hadn&#8217;t written a line of code at that point, but I switched to computer science immediately.</p><p>After my PhD in compiler construction, I joined a startup, then landed at Microsoft Research in 1999 thanks to a chance encounter with the logician Yuri Gurevich.</p><h3>Inside Microsoft Research and Cloud Build</h3><p>At Microsoft Research, we were like Switzerland&#8212;neutral across teams like Office, Windows, and Bing. We&#8217;d invent tools and ideas, but often the business units didn&#8217;t trust them. That changed when I was asked to build an engineering org.</p><p>We created <strong>Cloud Build</strong>, a distributed build system like Google&#8217;s Bazel. It reduced build times from hours to minutes and had a huge impact on iteration speed, productivity, and even morale. People stayed in flow. Builds were faster, cheaper, and smarter&#8212;running mostly on spare capacity.</p><h3>Janitorial work at Meta: cleaning up big data</h3><p><strong>You later joined Facebook (Meta). What was that like?</strong></p><p>A different world. No titles for engineers. Egalitarian, fast-moving. I joined to clean up the data warehouse&#8212;what they called &#8220;janitorial work.&#8221; At Meta, each type of workload had its own engine: time-series, batch, streaming, etc. This made understanding lineage and dependencies across systems extremely hard.</p><p>We responded by building UPM, a SQL pre-processor that stitched metadata across engines. It became part of Meta&#8217;s privacy infrastructure and compliance tooling, especially after the fallout from Cambridge Analytica.</p><h3>Databases as compilers</h3><p><strong>Let&#8217;s shift gears. Can you walk us through how analytical databases actually work&#8212;like a professor at a whiteboard?</strong></p><p>Sure. Think of a database like a compiler:</p><ol><li><p><strong>Parsing &amp; analysis:</strong> Is the SQL valid? Are the types correct?</p></li><li><p><strong>Optimization:</strong> SQL is declarative, so you can reorder joins, push down filters&#8212;based on algebraic laws like associativity.</p></li><li><p><strong>Execution:</strong> Often done in parallel, especially in modern warehouses.</p></li><li><p><strong>Storage:</strong> Columnar vs. row-based; optimized formats like Parquet or ClickHouse&#8217;s custom format.</p></li></ol><p>Historically, storage and compute were bundled. Now they&#8217;re decoupled. But when the engine understands the format deeply, performance is much better.</p><h3>The rise of modular and composable data platforms</h3><p><strong>How did we get from monolithic systems to the composable database architectures we have today?</strong></p><p>It started with the rise of big data&#8212;Hadoop, HDFS, MapReduce. That decoupled compute from storage. Columnar formats like Parquet enabled analytical workloads. Then came Iceberg, Delta Lake, and similar standards that enabled multiple engines to share data.</p><p>Modern databases are modular. For example, Postgres is transactional, but you can bolt on an OLAP engine for analytical queries. You can mix and match based on your workload. The result is a data ecosystem that&#8217;s far more flexible&#8212;but also more complex.</p><h3>Engine families: Snowflake, DuckDB, ClickHouse</h3><p><strong>Can you help us bucket the different kinds of engines out there?</strong></p><p>Totally. Here are three buckets:</p><ul><li><p><strong>Cloud-native engines:</strong> Snowflake, BigQuery. They&#8217;re optimized for massive scale, often with their own proprietary storage.</p></li><li><p><strong>Embedded/single-node engines:</strong> DuckDB, DataFusion. Great for local dev or embedded analytics. DuckDB is for users; DataFusion is for database builders.</p></li><li><p><strong>Real-time/high-throughput engines:</strong> ClickHouse, Druid. Tuned for streaming and extremely fast aggregations.</p></li></ul><p>Each has its trade-offs. Increasingly, projects are combining these. For example, you can plug DuckDB or DataFusion into Spark to speed up leaf-node execution. The whole engine space is getting more composable&#8212;and more interchangeable.</p><h3>The role of SDF in dbt&#8217;s future</h3><p><strong>If you think about the future where SDF is fully integrated into dbt Cloud, what does that enable?</strong></p><p>Initially, it might feel the same&#8212;but faster, smarter. Longer-term, we can give developers superpowers.</p><p>Imagine your dev environment proactively surfaces:</p><ul><li><p>&#8220;This data looks different than yesterday&#8212;want to investigate?&#8221;</p></li><li><p>&#8220;You&#8217;re missing a metric that&#8217;s often used alongside this one.&#8221;</p></li><li><p>&#8220;This join will behave differently on engine X&#8212;here&#8217;s what to change.&#8221;</p></li></ul><p>That&#8217;s the kind of intelligent, predictive developer experience we&#8217;re building. We&#8217;re catching SQL up to what IDEs have done for code. And if we can make logical plans portable across engines, dbt becomes the consistent interface across heterogeneous compute.</p><div><hr></div><p><em>This newsletter is sponsored by dbt Labs. Discover why more than 50,000 companies use dbt to accelerate their data development.</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___&quot;,&quot;text&quot;:&quot;Book a demo&quot;,&quot;action&quot;:null,&quot;class&quot;:&quot;button-wrapper&quot;}" data-component-name="ButtonCreateButton"><a class="button primary button-wrapper" href="https://www.getdbt.com/resources/dbt-cloud-demos-with-experts/?utm_medium=email&amp;utm_source=hs-email&amp;utm_campaign=__&amp;utm_content=biweekly-demos____&amp;utm_term=___"><span>Book a demo</span></a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://roundup.getdbt.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading The Analytics Engineering Roundup! Subscribe for free to receive new posts.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[A New Kind of Weird]]></title><description><![CDATA[Reflections on Data Council 2025]]></description><link>https://roundup.getdbt.com/p/a-new-kind-of-weird</link><guid isPermaLink="false">https://roundup.getdbt.com/p/a-new-kind-of-weird</guid><dc:creator><![CDATA[Jason Ganz]]></dc:creator><pubDate>Sun, 27 Apr 2025 11:52:04 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7b3c7b00-7933-40bf-8f72-d7af7c575fd0_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I did something wrong.</p><p>I try really hard and go into every conference with an open mind about what I&#8217;m going to learn. <em>Tabula rasa. Blank Slate. Beginners Mind.</em> This is actually a really important part of being able to continually grow and develop your analysis of the industry rather than getting stuck in familiar mental grooves.</p><p>But for this year&#8217;s Data Council, I have to admit I went in with a preconceived take on the newsletter I wanted to be sending out today.</p><p><em>&#8220;I&#8217;ve been to a whole lot of data conferences that talk about the intersection of data and generative AI&#8221;</em>, I&#8217;d write triumphantly, <em>&#8220;but this was the first one I&#8217;ve been to where data and AI felt <strong>truly</strong> integrated, where the worlds <strong>finally</strong> converged</em>&#8221;.</p><p>And you know what? It was true. You couldn&#8217;t throw a stone in the convention hall without hitting a booth for AI-assisted data development or using your data in agent systems.</p><p>GenAI applications, after all, aren&#8217;t just running on models trained on massive datasets built and maintained with many of the tools and open source libraries created by the people and organizations at Data Council. Their usage and utility also depends on strong infrastructure, as Martin has told us.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aKSl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aKSl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 424w, https://substackcdn.com/image/fetch/$s_!aKSl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 848w, https://substackcdn.com/image/fetch/$s_!aKSl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 1272w, https://substackcdn.com/image/fetch/$s_!aKSl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aKSl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:322377,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/162200569?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aKSl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 424w, https://substackcdn.com/image/fetch/$s_!aKSl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 848w, https://substackcdn.com/image/fetch/$s_!aKSl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 1272w, https://substackcdn.com/image/fetch/$s_!aKSl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd09d0df-ebbe-44f4-9d94-68853542a21c_2064x1126.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>We saw a lot of very cool data + AI infrastructure at Data Council!</p><ul><li><p><a href="https://www.bauplanlabs.com/">Bauplan</a>, fresh off their recent fundraise, walked us through the minimum viable data platform</p></li><li><p>The Snowflake booth showed how <a href="https://docs.snowflake.com/en/user-guide/snowflake-cortex/cortex-agents">Cortex Agents</a> can sit in your database and perform useful work</p></li><li><p>Lloyd Tabb gave a great walkthrough of <a href="https://www.malloydata.dev/">Malloy</a> and repeatedly emphasized the benefits of writing LLM-based analytics queries with a Semantic Layer as opposed to going straight to SQL</p></li><li><p>Jacob ran a session on <a href="https://x.com/matsonj/status/1898504109193613667">vibe-coding your data engineering workflows</a></p></li><li><p>MCP was the talk of the town, with notable MCP servers being discussed by <a href="https://clickhouse.com/blog/agenthouse-demo-clickhouse-llm-mcp">ClickHouse</a>, <a href="https://github.com/motherduckdb/mcp-server-motherduck">Motherduck</a> and <a href="https://docs.getdbt.com/blog/introducing-dbt-mcp-server">yours truly</a>.</p></li></ul><p>And then of course we had <a href="https://www.linkedin.com/in/eliasdefaria/">Elias</a> discussing SDF + dbt and walking through a new bit of data infrastructure that I believe is going to play a significant role in the story of how data + Gen AI fit together, the development of the <a href="https://www.getdbt.com/blog/building-the-next-gen-dbt-engine">new dbt engine</a> - Rust-based, type-aware and ready to validate your SQL queries are dialect accurate and governed, whether they are written by a human or a machine.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SHnt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SHnt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 424w, https://substackcdn.com/image/fetch/$s_!SHnt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 848w, https://substackcdn.com/image/fetch/$s_!SHnt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 1272w, https://substackcdn.com/image/fetch/$s_!SHnt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SHnt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png" width="3024" height="1816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1816,&quot;width&quot;:3024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8022998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/162200569?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8a6e56ae-0e62-4e20-b69e-9cf799f6733f_3024x4032.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SHnt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 424w, https://substackcdn.com/image/fetch/$s_!SHnt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 848w, https://substackcdn.com/image/fetch/$s_!SHnt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 1272w, https://substackcdn.com/image/fetch/$s_!SHnt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F833ccdd3-c8d7-42af-87d3-e2d857c5e80b_3024x1816.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So in a certain sense, I <em>am</em> walking away from this Data Council feeling like the worlds of generative AI and traditional data infra are closer together than ever.</p><p>But in another, deeper sense, I&#8217;m not.</p><h3>A familiar kind of weird and a new kind of weird</h3><p>Three years ago, in his reflections on Data Council, Drew had one request: &#8220;<a href="https://roundup.getdbt.com/p/keep-data-council-weird">Keep Data Council Weird</a>&#8221;. At the time, we were wondering if the ecosystem was becoming too vendor+VC driven and hoping that we&#8217;d still maintain our spunky outsider energy.</p><p>Well, I have to be honest with you, this Data Council felt pretty darn weird.</p><p>Partly, it felt weird in a familiar way. I asked Drew if this year felt weird and here&#8217;s what he told me:</p><blockquote><p>The venue - a masonic temple - was gorgeous and unlike any conference venue I&#8217;ve been to before. My legs hurt from walking up and down 4 flights of carpeted stairs. I watched Elias&#8217;s talk from a parapet (is that even the word?)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> in a column adorned theater. I think I saw a crucifix. The bathrooms had couches in them. Scott B and I talked about our skincare routines. I saw a lot of old friends and former coworkers. I befriended [redacted]. My beef with [redacted] grew even deeper. I had a <a href="https://trueburgeroakland.com/">top 3 all-time cheeseburger</a> and a bottom 3 all-time dessert (Mango Piggy). Pete and the whole Data Council team put on one hell of an event this year!</p></blockquote><p>If you&#8217;ve been around the block enough times, this is a familiar kind of weirdness. Comforting.</p><p>It also felt weird in a different way though:</p><p>Because fundamentally, even though data infra + AI are moving ever closer together, there are <em>big</em> differences in how each side moves and progresses.</p><p>The reason boils down to this:</p><p><strong>Data Infra is heavily engineered, based on building well-understood systems and standards</strong>.</p><p>It <em>moves</em> at the speed of ecosystems and standards. Three years ago at Data Council I&#8217;m sure there were people talking about Apache Iceberg and wondering whether it would become adopted across the industry. We&#8217;re big believers in Iceberg at dbt Labs and I expect to see strong and meaningful adoption of Iceberg over the next three years. I think an 80th percentile good outcome for Iceberg adoption looks like a world where organizations are not meaningfully constrained by their choice of data platform and are able to use Iceberg to avoid vendor lock-in and have true cross-platform control of how they operate on their data.</p><p><strong>Generative AI is built differently, and it moves at a different speed.</strong></p><p>The folks at Anthropic like to say that LLMs are <a href="https://www.youtube.com/watch?v=TxhhMTOTMDg">grown, not built.</a> Three years ago when Drew said that we should keep Data Council Weird, we were about 9 months out from the release of ChatGPT, and a year away from GPT-4. </p><p>Since then, the price of a query to GPT-4 has fallen by somewhere around 100x. OpenAI is <a href="https://www.theinformation.com/articles/openai-forecasts-revenue-topping-125-billion-2029-agents-new-products-gain">projecting $125 billion in revenue by 2029</a>. The latest paradigm shift, reasoning models, are around six months old. </p><p>I don&#8217;t know what an 80th percentile &#8220;good&#8221; (meaning fast) outcome looks like here, but there are people a lot closer to this than me that are saying we&#8217;re going to be <a href="https://ai-2027.com/">deploying bio-engineered algae nanobots to fuel the data centers</a> doing recursively self-improving AI by the time we hit three year&#8217;s from now&#8217;s Data Council.</p><p>That, to me, is pretty weird. </p><p>The weirdness of two worlds, closer than ever before but apparently moving at blindingly different speeds. </p><p>The weirdness of sitting in a talk and getting legitimately excited by the idea that we as an ecosystem can robustly adopt the nearly-decade-old <a href="https://arrow.apache.org/">Apache Arrow</a> and then going into the hall to talk to someone who had just walked out of a talk on <a href="https://x.com/BEBischof">Bryan&#8217;s</a> Foundation Models track and was wondering to what extent 2 year old LLM based coding workflows are going to change whether any of these questions are still relevant.</p><p>So what do we do with this?</p><p>Look, maybe one day soon, we&#8217;ll pinch ourselves, bolt awake and think &#8220;man that whole AI thing was crazy&#8221;. I&#8217;ll look back on this newsletter, cringe a bit about my prognostication and sheepishly admit that maybe I got carried away by drawing out lines on a curve. God knows it&#8217;s happened before.</p><p>But &#8230; maybe not. And in that world, what relevance does the data infra have?</p><p>I think it means that all of this matters a lot - even more so in this world. It means that pretty soon, the data systems and data infrastructure we build are going to be powering a whole lot of systems that interface more directly with the world than we are used to.</p><p>Because my prewritten take about data systems and AI workflows becoming increasingly intertwined and dependent on each other <em>was right</em>. And now we need to figure out how to make engineered data infrastructure that move at human scale support LLMs that look like they are moving much faster and are still <a href="https://www.darioamodei.com/post/the-urgency-of-interpretability">fundamentally mysterious to us</a>.</p><p>The real world and the data we represent it with has a lot of complexity. And if we&#8217;re about to have AI systems that are 100x cheaper and 100x more powerful than what we have today operating on the tools, systems and standards we build, then they&#8217;d better be really good.</p><p>I don&#8217;t have an exact answer to how we should approach this. I don&#8217;t think anyone does.</p><p>I do know that I&#8217;m looking forward to next year&#8217;s Data Council and the one after that and the one after that too. I&#8217;m hoping that alongside the new weirdness, we keep the familiar weirdness and that we all continue to share our knowledge, our expertise and perhaps most importantly our mango piggies.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lk6_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lk6_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 424w, https://substackcdn.com/image/fetch/$s_!Lk6_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 848w, https://substackcdn.com/image/fetch/$s_!Lk6_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 1272w, https://substackcdn.com/image/fetch/$s_!Lk6_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lk6_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png" width="1250" height="614" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:614,&quot;width&quot;:1250,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1482330,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/162200569?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lk6_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 424w, https://substackcdn.com/image/fetch/$s_!Lk6_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 848w, https://substackcdn.com/image/fetch/$s_!Lk6_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 1272w, https://substackcdn.com/image/fetch/$s_!Lk6_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe152151f-ad89-425c-a36e-1933b90c7d1b_1250x614.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Anders eating a well deserved Mango Piggy</figcaption></figure></div><h3>Appendix</h3><p>As I was writing this, the ever thoughtful Benn Stancil released a post touching heavily on MCP and the dbt MCP.</p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:162134831,&quot;url&quot;:&quot;https://benn.substack.com/p/a-new-invisible-hand&quot;,&quot;publication_id&quot;:23588,&quot;publication_name&quot;:&quot;benn.substack&quot;,&quot;publication_logo_url&quot;:null,&quot;title&quot;:&quot;A new invisible hand&quot;,&quot;truncated_body_text&quot;:&quot;&quot;,&quot;date&quot;:&quot;2025-04-25T16:26:33.799Z&quot;,&quot;like_count&quot;:16,&quot;comment_count&quot;:4,&quot;bylines&quot;:[{&quot;id&quot;:5667744,&quot;name&quot;:&quot;Benn Stancil&quot;,&quot;handle&quot;:&quot;benn&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/a317e60a-9bd1-4c75-bb54-66d517f735dc_1100x1100.jpeg&quot;,&quot;bio&quot;:&quot;Working at benn.company. Tweeting at benn.chat. Posting pictures at benn.photos. Networking with professionals at benn.work.&quot;,&quot;profile_set_up_at&quot;:&quot;2021-04-27T23:00:23.729Z&quot;,&quot;reader_installed_at&quot;:&quot;2022-10-21T19:27:33.368Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:254785,&quot;user_id&quot;:5667744,&quot;publication_id&quot;:23588,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:true,&quot;publication&quot;:{&quot;id&quot;:23588,&quot;name&quot;:&quot;benn.substack&quot;,&quot;subdomain&quot;:&quot;benn&quot;,&quot;custom_domain&quot;:null,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;A weekly Substack on data and technology, with some occasional conversations about culture, sports, and politics. &quot;,&quot;logo_url&quot;:null,&quot;author_id&quot;:5667744,&quot;primary_user_id&quot;:5667744,&quot;theme_var_background_pop&quot;:&quot;#FF6B00&quot;,&quot;created_at&quot;:&quot;2019-12-15T21:00:48.339Z&quot;,&quot;email_from_name&quot;:&quot;Benn Stancil&quot;,&quot;copyright&quot;:&quot;Benn Stancil&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;disabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;homepage_type&quot;:&quot;newspaper&quot;,&quot;is_personal_mode&quot;:false}}],&quot;twitter_screen_name&quot;:&quot;bennstancil&quot;,&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:true,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://benn.substack.com/p/a-new-invisible-hand?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><span></span><span class="embedded-post-publication-name">benn.substack</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">A new invisible hand</div></div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">10 months ago &#183; 16 likes &#183; 4 comments &#183; Benn Stancil</div></a></div><p>As with basically everything Benn writes - it&#8217;s worth your time. The post probably deserves a full response, so I&#8217;ll save commentary for another day, but I recommend you check it out.</p><p><em>The analytics engineering roundup is sponsored by dbt Labs.</em></p><p><em>If you want to see what the big kerfuffle about dbt + SDF is all about, plus a whole lot more, join Elias and the dbt team for our Cloud Launch Showcase on 5/28 (parapet not included).</em></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase&quot;,&quot;text&quot;:&quot;Sign up for the Cloud Launch Showcase&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.getdbt.com/resources/webinars/2025-dbt-cloud-launch-showcase"><span>Sign up for the Cloud Launch Showcase</span></a></p><p></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Editors note: That is not the word</p></div></div>]]></content:encoded></item><item><title><![CDATA[How AI will Disrupt BI As We Know It]]></title><description><![CDATA[A continuation-in-spirit from my recent post &#8220;How AI will Disrupt Data Engineering As We Know It.&#8221;]]></description><link>https://roundup.getdbt.com/p/how-ai-will-disrupt-bi-as-we-know</link><guid isPermaLink="false">https://roundup.getdbt.com/p/how-ai-will-disrupt-bi-as-we-know</guid><dc:creator><![CDATA[Tristan Handy]]></dc:creator><pubDate>Sun, 06 Apr 2025 11:02:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0coS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0coS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0coS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!0coS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!0coS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!0coS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0coS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp" width="728" height="416" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:832,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:598210,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/160597435?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0coS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!0coS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!0coS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!0coS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F26fee9e7-b4fc-4367-a8cb-ceea08c7f970_1792x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Credit: DALL-E</figcaption></figure></div><p>Business intelligence is on a collision course with AI.</p><p>The collision itself hasn&#8217;t happened yet, but it&#8217;s clearly coming. The inevitability of this has been clear roughly since the launch of ChatGPT, but no one knew exactly what shape that would take.</p><p>Today I want to propose how that collision is going to happen and what will happen in its aftermath.</p><p>I think it will be a very good thing for data practitioners of all stripes&#8212;those who officially have the word &#8216;data&#8217; in their title but also everyone else who simply uses data in the service of their larger job. So: I&#8217;m all for it.</p><p>Before getting into AI part of the story, I need to introduce two specific mental models.</p><p>Let&#8217;s go.</p><h2>BI is a Portfolio of Stuff</h2><p>We all use the term &#8220;BI&#8221; but have become inured to what an Orwellian term it is. &#8220;Business intelligence&#8221; isn&#8217;t descriptive, it is industry-speak for a bunch of stuff glued together in order to achieve a desired user outcome: know facts about a business using tabular data.</p><p>For a long time, BI included a bunch of stuff that it no longer does. Like: data processing. Pre-cloud, BI tools processed data locally and often had proprietary processing engines. They competed on being fast.</p><p>With the cloud, that evaporated. Local data processing was anathema. BI tools got easier to build but gave up a part of their value proposition.</p><p>In today&#8217;s post-cloud world, I would suggest that BI tools have three jobs:</p><ol><li><p><strong>Modeling</strong> Define the semantic concepts behind your structured data: metrics, dimensions, joins, etc. Think: LookML.</p></li><li><p><strong>Exploratory data analysis (EDA)</strong> The iterative process of exploring data in search of useful insights. Highly iterative, flow-state, and unpredictable. Think: Looker explore window.</p></li><li><p><strong>Presentation</strong> The aggregation of multiple data artifacts together to present a single cohesive narrative that can be shared out to potentially many others within an organization, all governed by a permission model. Think: Looker dashboard.</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xC29!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xC29!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 424w, https://substackcdn.com/image/fetch/$s_!xC29!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 848w, https://substackcdn.com/image/fetch/$s_!xC29!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 1272w, https://substackcdn.com/image/fetch/$s_!xC29!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xC29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png" width="562" height="211.90796703296704" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:1456,&quot;resizeWidth&quot;:562,&quot;bytes&quot;:63228,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/160597435?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xC29!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 424w, https://substackcdn.com/image/fetch/$s_!xC29!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 848w, https://substackcdn.com/image/fetch/$s_!xC29!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 1272w, https://substackcdn.com/image/fetch/$s_!xC29!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F36c7dda7-38ab-4619-88f0-7019877a7f6a_1558x587.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Some tools skip modeling and just allow users to do EDA without a model. EDA and presentation are the most core jobs of any BI tool and every BI tool I&#8217;m familiar with does both. And it is the fact that BI tools facilitate the EDA process that enables them to govern and share the presentation of that analysis.</p><h2>Scaling the Criticality of an Analysis</h2><p><em>All credit to my collaborator Dave Connors for this mental model &#129438;</em></p><p>Generally speaking, artifacts pass through a few lifecycle stages as they mature into data products supporting production use cases. Think about these stages as the &#8216;production line&#8217; of BI.</p><h3>Phase 0: Exploratory Analysis</h3><p>The first thing data practitioners do when faced with a business question is to start developing low fidelity sketches to try to answer it.</p><p>The vast majority of the work generated here will be thrown away, so there are low expectations code quality and governance. The primary goals of the best EDA experiences are iteration speed, flow state, and flexibility.</p><h3>Phase 1: Personal Reporting</h3><p>At a certain point, some exploratory analysis will cross over into a true insight; your question is answered, your curiosity sated. The question is important enough that you want to make sure you can return to it later. But it is not yet &#8220;ready for prime time&#8221;&#8212;you&#8217;re not ready to share it with others and have it be a part of someone&#8217;s operating cadence.</p><p>Some BI tools have a separate section for your &#8220;personal space&#8221;&#8212;think about your personal folder in Looker.</p><h3>Phase 2: Shared Reporting</h3><p>The moment that a report gets shared with another person, the required governance characteristics of a data artifact increase significantly. When you create a report you understand its context; when someone else starts using it they just expect it to be correct.</p><p>In phases 0 and 1, there may not be any governance applied&#8212;all governance may be applied at the compute layer with grants. But once you share an artifact, it is the governance at the BI layer that determines who gets to see what. This is simply because <em>most data consumers don&#8217;t have accounts within the data platform</em> and so the BI tool takes over as the arbiter.</p><p>In phases 0 and 1, there is also no auditability requirement. Auditability, change tracking, and general data ops best practices are introduced when artifacts are shared with others in Phase 2.</p><h3>Phase 3: Production Artifact</h3><p>When shared reporting reaches a very high level of criticality (frequent access by a large number of end users, agreed upon SLAs, supports a critical business process, dynamic features), it&#8217;s officially &#8220;in production&#8221; and needs to be owned and operated like any other production data asset.</p><p>===</p><p>If you think about these stages as the &#8216;production line&#8217; of BI, the most important job of a BI tool is to be the conveyor belt through all of these stages. Start with raw materials, end with a production data product. At each phase of maturity, it&#8217;s easy to extend the product to support the next set of capabilities: governance, dynamic filters, SSO, etc. You never think about those things during Phase 0, but as your work progresses, the BI tool makes it straightforward to progressively add those capabilities.</p><p>But for this all to work, you gotta start the process inside the BI tool all the way back from Phase 0. You can&#8217;t do your EDA in Jupyter &amp; Pandas and expect to ship it to users in Tableau&#8230;that&#8217;s not how that works.</p><p>So: you gotta do your EDA in a BI tool to take advantage of the &#8220;production line&#8221;. But&#8230;are BI tools typically the best way to do EDA? We&#8217;ll return to that later.</p><h2>MCP and AI-as-Aggregator</h2><p>The final thing we need to understand is the impacts of a <em>context protocol</em>. I wrote about this <a href="https://roundup.getdbt.com/p/how-ai-will-disrupt-data-engineering">a few weeks ago</a>:</p><blockquote><p>The easiest thing to do for any technology vendor at the very onset of the AI era was to take all of the domain-specific context that you had and surface it to users in a chat interface. And we did the same thing. It was (and is) quite good&#8212;it does a great job of allowing users to ask business questions and answering them with semantic-layer-governed responses.</p><p>The problem with this approach is that users don&#8217;t actually want to interact with dozens of chat interfaces. They don&#8217;t want to remember to go to a given tool to get one type of answer and another tool for another type of answer. There will not be 30 chat experiences all with different context. There will be one&#8230;or maybe just a few. But likely a single dominant one.</p><p>This is how <a href="https://stratechery.com/aggregation-theory/">aggregators</a> work. You likely don&#8217;t use a bunch of different search engines&#8212;you probably just use one, and it is probably Google. This is how chat will go as well.</p><p>The problem is, Google could scrape the web and respond to all queries based on that knowledge. But ChatGPT cannot know all of the information you want to ask it questions about (at least, yet). That lack of business context is the problem.</p><p>That&#8217;s where a <em>context protocol</em> comes in. A context protocol&#8212;a somewhat new topic in the public AI conversation&#8212;is a standardized way for services to provide additional context to models via an open protocol. The most promising one today is called <a href="https://modelcontextprotocol.io/introduction">MCP</a>, but whether or not MCP wins, the awareness/excitement/support for this idea has developed a ton of momentum and I am fairly convicted that <em>something like this</em> will become real and widely-supported.</p><p>There will be a large number of context providers (every source of valuable enterprise context) and a large number of context consumers (different products with AI capabilities). There is no way to create point-to-point integrations to facilitate this. A protocol will be needed if we are going to see the right type of advancements, and I think it will happen.</p><p>Imagine that your license to ChatGPT enterprise or Claude Desktop or whatever <em>already came with</em> a connection to all of the metadata about every piece of structured data you had access to. What was there, how trustworthy it was, how suitable it was for the analysis you were describing, etc.</p></blockquote><p>Well, in the intervening weeks since I wrote this, a couple of things have happened. First, this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WIrJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WIrJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 424w, https://substackcdn.com/image/fetch/$s_!WIrJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 848w, https://substackcdn.com/image/fetch/$s_!WIrJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 1272w, https://substackcdn.com/image/fetch/$s_!WIrJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WIrJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png" width="386" height="317.56028368794324" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:696,&quot;width&quot;:846,&quot;resizeWidth&quot;:386,&quot;bytes&quot;:121055,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/160597435?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WIrJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 424w, https://substackcdn.com/image/fetch/$s_!WIrJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 848w, https://substackcdn.com/image/fetch/$s_!WIrJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 1272w, https://substackcdn.com/image/fetch/$s_!WIrJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0c4155a5-353e-4231-ac09-9a0259256bfb_846x696.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><a href="https://x.com/sama/status/1904957253456941061">https://x.com/sama/status/1904957253456941061</a></figcaption></figure></div><p>&#8230;then this:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ToFw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ToFw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 424w, https://substackcdn.com/image/fetch/$s_!ToFw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 848w, https://substackcdn.com/image/fetch/$s_!ToFw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 1272w, https://substackcdn.com/image/fetch/$s_!ToFw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ToFw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png" width="402" height="239.1288056206089" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:508,&quot;width&quot;:854,&quot;resizeWidth&quot;:402,&quot;bytes&quot;:81893,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/160597435?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ToFw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 424w, https://substackcdn.com/image/fetch/$s_!ToFw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 848w, https://substackcdn.com/image/fetch/$s_!ToFw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 1272w, https://substackcdn.com/image/fetch/$s_!ToFw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa40f29e8-6bef-42b8-a608-1f77c938ffee_854x508.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption"><a href="https://x.com/sundarpichai/status/1906484930957193255">https://x.com/sundarpichai/status/1906484930957193255</a></figcaption></figure></div><p>Clearly this thing is going somewhere.</p><p>Just as momentously, I have gotten access to an internal-only dbt/MCP-powered experience in Claude Desktop. In it, I can ask every type of metadata question I might want (powered by our Metadata API) and I can also ask questions about all of our business metrics (powered by our Semantic Layer API).</p><p>It is incredible. I don&#8217;t want to share too much right now, but &#8230; having your data and metadata available in the context of a modern reasoning model is incredible.</p><h2>BI in an AI-First World</h2><p>Ok, we now understand: the jobs of a BI tool, BI conveyor belt, and how to get structured data context into your AI-of-choice. We&#8217;re finally in position to tackle the coming collision.</p><p>Here it is; plain and simple:</p><ol><li><p>AI is going to be meaningfully better at exploratory data analysis than any BI tool.</p></li><li><p>If you take away EDA from BI, the &#8216;conveyor belt&#8217; model breaks down. And the conveyor belt model is the primary reason you use your current BI tool.</p></li><li><p>It is not yet clear how the BI ecosystem will adapt to this new reality.</p></li></ol><p>That&#8217;s it. That&#8217;s my entire argument. Let&#8217;s see if it holds up.</p><h3>Artificial Intelligence will Far Outstrip Business Intelligence for Exploratory Data Analysis</h3><p>There are a lot of data tasks that AI is good at. I&#8217;ve talked about a lot of these in the context of data engineering <a href="https://roundup.getdbt.com/p/how-ai-will-disrupt-data-engineering">here</a>. But the area of data analysis that is going to be <em>most</em> benefited from AI is EDA.</p><p>I am confident about that for two reasons. First, I have empirically validated this first-hand. The dbt + MCP + Claude 3.7 combo that I outlined earlier is just dramatically better at EDA than anything I&#8217;ve experienced in my life, and it&#8217;s getting better fast. But I am not ready to show you that (it&#8217;s single-digit weeks away from a public demo!), so you may not believe me. Fair.</p><p>The second reason I&#8217;m confident about this is the fact that most time spent in EDA is writing code (whether done by hand or via a GUI). And we now know how good leading-edge models are at writing code when supplied with the right context. Whether you want to reference <a href="https://www.techsistence.com/p/up-to-90-of-my-code-is-now-generated">individual developer testimonials</a> or <a href="https://www.cnbc.com/2025/03/15/y-combinator-startups-are-fastest-growing-in-fund-history-because-of-ai.html?utm_source=tldrnewsletter">the head of YC</a> or <a href="https://x.com/karpathy/status/1886192184808149383">Andrej Karpathy</a> or <a href="https://www.forbes.com/sites/jackkelly/2024/11/01/ai-code-and-the-future-of-software-engineers/">Google</a>, it all lines up. And it just so happens that the two software engineers whose opinions I trust most in the world&#8212;my cofounders Drew and Connor&#8212;have gone all in on Cursor over the last 3 months and are not-quite-but-almost religious about the experience.</p><p>If you find yourself skeptical of this, here are a few things to keep in mind.</p><ol><li><p>You don&#8217;t need the LLM to answer &#8216;why&#8217; questions, or generate hypotheses, to have it be far superior than your current workflow. Rather&#8212;it just makes you <em>a lot faster</em> because it can write EDA code a whole lot faster than you can (whether you&#8217;re writing Excel formulas or dataframe operations).</p></li><li><p>Accuracy is a non-issue as long as you ask a question that can be governed by a semantic layer. The code written tends to be: get data from the SL, manipulate it in Python, generate a chart using some dynamic javascript library. If you can&#8217;t get a dataset governed by an SL query, text-to-SQL does continue to improve with sufficient context.</p></li></ol><p>Just imagine: an interface that allows you to just have your questions answered far faster. You remain the objective function and the creative drive behind the process, AI is simply better and faster than you are at writing analytical code.</p><p>IMO that shouldn&#8217;t feel threatening, <strong>that should feel empowering</strong>. I seriously lost it the first time I interacted with our internal data in this type of experience. The primary value prop of a data analyst shouldn&#8217;t be writing code, it should be analytical problem solving and generating action.</p><h3>The Conveyor Belt Model Breaks Down</h3><p>You typically don&#8217;t tend to use your BI tool because it is the fastest or most delightful EDA experience. You use it because, when you have something to publish to your coworkers, you know exactly how to do that.</p><p>But what if another tool were <em>so much better</em> at EDA that <em>you would be handicapping yourself if you didn&#8217;t use it?</em> What would you do?</p><p>There are likely three answers.</p><p>First, you could go back to publishing one-off assets. If you ask any AI experience to &#8220;give me that in an Excel file&#8221; most of them have no problem doing that. So maybe you just go back to shipping attachments. But that doesn&#8217;t feel like progress.</p><p>Second, having iterated and found the insight you were looking for, you now have to reconstitute that analysis inside of your BI tool of choice. In practice this will likely only happen rarely; it is not a stable equilibrium because every human hates double work.</p><p>Third, and hopefully preferable, is that we find some way to pull back in the results of an exploration into the governed framework of the BI tool. Imagine asking &#8220;make a PowerBI worksheet out of this analysis.&#8221; We will need to get deeper into the MCP era to see exactly how this will play out, but I&#8217;m optimistic that it will be possible.</p><p>The third option still sees the BI tool as an important governance and presentation layer but pulls out the most strategic responsibility (EDA) from its portfolio.</p><h3>A Very Different BI Tool</h3><p>BI tools used to ship with compute engines. Today they do not.</p><p>What if BI tools were no longer the primary way EDA was done?</p><p>What if their primary job was to render data artifacts a governed, interactive environment?</p><p>That is still an incredibly valuable thing, and needed as long as humans are going to continue to interact with structured data (IMO: a long time). But it&#8217;s not what BI tools look like today.</p><p>Most BI tool vendors want to pull this new EDA experience <em>inside their Chrome</em>&#8212;exposing AI-powered interfaces inside their products. I don&#8217;t believe this will be how most users do EDA, for three reasons:</p><ol><li><p><strong>User behavior</strong> <br>Aggregation theory will dominate, every knowledge worker inside of a company needs access to this functionality and they&#8217;re not all going to think to go to a specific tool first, they&#8217;re going to prefer to simply ask data questions in the same place they ask all of their other questions. Claude, ChatGPT Enterprise, whatever.</p></li><li><p><strong>Tool combinations</strong> <br>MCP is not only powerful because it lets you use a single tool, it is powerful because it is a pluggable framework to pull in all kinds of tools for the model to use. You&#8217;ll be able to ask a BI question (&#8221;Show me our most important renewals for the coming quarter&#8221;) and then immediately act on it in another tool (&#8221;Email the main point of contact on the account to set up a check-in meeting&#8221;). Having all of these tools interact together inside of a single interface is combinatorially powerful. There is already a large ecosystem of tooling available and community-driven innovation is happening <em>fast</em>.</p></li><li><p><strong>Tech</strong> <br>Except for MSFT, current BI vendors are not AI research labs. They are just not going to create better models or be the primary destination for all AI interactions within a company.</p></li></ol><h2>My Predictions</h2><p>I think that the BI workflow that has dominated for the past ~15 years is going to change significantly over the next 2. EDA will significantly migrate over to AI interfaces, enabled by MCP.</p><p>I think this will be incredibly positive for all knowledge workers throughout a company. It will enable more users to create sophisticated analytics and will enable existing data practitioners to move significantly faster.</p><p>I think this will be a headwind to many current BI vendors. BI is extremely sticky and this change isn&#8217;t going to happen overnight, but it will be a headwind.</p><p>I think there is likely space for new players to innovate: to be the best place to aggregate and govern all of the artifacts built in this new workflow.</p><p>I&#8217;ll return to this post after six months and see how my predictions are faring!</p>]]></content:encoded></item><item><title><![CDATA[Iceberg?? Give it a REST!]]></title><description><![CDATA[The new abstraction that changes nothing... and everything]]></description><link>https://roundup.getdbt.com/p/iceberg-give-it-a-rest</link><guid isPermaLink="false">https://roundup.getdbt.com/p/iceberg-give-it-a-rest</guid><dc:creator><![CDATA[Anders Swanson]]></dc:creator><pubDate>Sun, 30 Mar 2025 11:01:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TBhF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The analytics engineering landscape is shifting beneath our feet as our familiar data warehouse coalesces into the data engineer's lake house&#8212;all thanks to a powerful new abstraction. For us SQL lovers, the future paradoxically resembles both the present and past, yet the opportunity ahead is simply too compelling to ignore.</p><p>Today, I&#8217;m going to sketch out for you:</p><ol><li><p>what exactly is this abstraction of abstractions at the heart of this sea change</p></li><li><p>the lay of the land today: how far things have come, what&#8217;s still holding us back, and open questions</p></li><li><p>[EXTRA CREDIT]<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> technical weeds: table format convergence, S3 tables, vended credentials, and more</p></li></ol><p>To not bury the lede any further, I&#8217;ll be talking about <a href="https://iceberg.apache.org/">Apache Iceberg&#8482;&#65039;</a> , and a further abstraction: the <a href="https://iceberg.apache.org/terms/#catalog">Iceberg REST Catalog Specification</a> (IRC).</p><p>The current state of Iceberg isn't easy to navigate. Despite all the buzz, the technology is still young. The ecosystem changes quickly&#8212;each day brings something new, from proposals to private previews to updates in <code>pyiceberg</code>.</p><p>So what's really going on here? What matters most? Why should you care? And if even Iceberg's creator says we shouldn't have to think about it (more below), why is everyone talking about it?</p><p>Over the past year, I&#8217;ve been working with many data teams to learn and implement Iceberg in production. I'm convinced of the Iceberg&#8217;s potential to impact many more analytics engineering teams. True Iceberg adoption will happen once robustly integrated with all major data platforms but even where it has been integrated there&#8217;s last-mile user experience missing that&#8217;s dampening the adoption curve. But it&#8217;s improving everyday!</p><p>So let&#8217;s get into it!</p><h1>Iceberg: A tough nut to crack</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TBhF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TBhF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TBhF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TBhF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TBhF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TBhF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg" width="500" height="560" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:560,&quot;width&quot;:500,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;r/dataengineering - Apache Iceberg: SQL and ACID semantics in the front, scalable object storage in the back&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="r/dataengineering - Apache Iceberg: SQL and ACID semantics in the front, scalable object storage in the back" title="r/dataengineering - Apache Iceberg: SQL and ACID semantics in the front, scalable object storage in the back" srcset="https://substackcdn.com/image/fetch/$s_!TBhF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TBhF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TBhF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TBhF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9219149c-3e08-4f73-a68a-4ca508a025a1_500x560.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">all credit to the great Brian Olsen for this one</figcaption></figure></div><p>Understanding Apache Iceberg is a &#8220;tough nut to crack&#8221; because it&#8217;s easy to get lost in the technical weeds and miss the big picture<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a>. Ironically, Iceberg exists so that most people don&#8217;t need to think about it at all! In the first five minutes of his talk at Data Council last year, Ryan Blue, a creator of Apache Iceberg says almost exactly that:</p><blockquote><p><em>Iceberg should be invisible [in that it should aim to]:</em></p><ul><li><p><em>avoid unpleasant surprises</em></p></li><li><p><em>don&#8217;t steal attention and reduce context switching</em></p></li></ul><p>Ryan Blue, <a href="https://www.youtube.com/watch?v=_GW3GYZK66U">"Why You Shouldn't Care about Iceberg"</a></p></blockquote><p>This sounds a lot to me like a powerful abstraction that lets you focus on the task at hand without getting bogged down in details.</p><p>&#8220;Bogged down in details&#8221; is an apt description for data engineering until recently. MapReduce, Hadoop, Hive, and Spark were all powerful tools that got the job done, but no one will claim that these were easy to use. You could never just write SQL &#8212; a portion of your brain was always reserved for reasoning about where and how the data was written and avoiding unpleasant surprises and edge cases. Your resulting pipeline could process petabytes of data, and you had the sweat to show for it.</p><p>&#8220;bigger data &#8594; more work&#8221; is a reasonable heuristic, but the impetus for Iceberg was an attempt to minimize the cognitive burden with a new abstraction of a table that just works like a database&#8217;s table (e.g. Postgres or SQL Server)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a>. Iceberg isn&#8217;t a silver bullet that solves all problems with large analytic data, but it&#8217;s a stronger, empowering abstraction.</p><h2>The IRC (no not that IRC)</h2><h3>The Summer of <s>Love</s> Iceberg Catalogs</h3><p>Ten months ago now, in June 2024, during what we colloquially refer to as &#8220;Summit Season&#8221;, two hallmark announcements were made within 24 hours of each other.</p><p>&#8220;Iceberg steals the Summits spotlight&#8221; &amp; &#8220;Iceberg wins the table format war!&#8221; comprise the the gist of many folks&#8217; reactions. I largely agree, with a small tweak: the real winner was the Iceberg REST Catalog.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OTnF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OTnF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 424w, https://substackcdn.com/image/fetch/$s_!OTnF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 848w, https://substackcdn.com/image/fetch/$s_!OTnF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 1272w, https://substackcdn.com/image/fetch/$s_!OTnF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OTnF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png" width="1342" height="448" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:448,&quot;width&quot;:1342,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:88566,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/160067584?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OTnF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 424w, https://substackcdn.com/image/fetch/$s_!OTnF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 848w, https://substackcdn.com/image/fetch/$s_!OTnF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 1272w, https://substackcdn.com/image/fetch/$s_!OTnF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8369ec18-6c29-4d64-b5b5-a11e1189b431_1342x448.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>What does an IRC do for me?</h3><p>If you want to know what an IRC is and does, I&#8217;ve put that section at the bottom of this post in hopes of avoiding the technical weeds and staying high-level</p><p>The IRC reminds me when I first started using Dropbox in college. Countless times, I&#8217;d often stay up all night before a deadline writing in Microsoft Word on my Macbook Pro. In the morning I&#8217;d run across campus to a desktop PC at the computer lab. In a minute I could get my paper off the internet opened in Word again so I could print it.</p><p>It&#8217;s easy now to take for granted the power of the abstraction that Dropbox represented even at a time when the Internet already existed. There was complexity behinds the scenes, but the core UX was magic in it&#8217;s simplicity: It was a folder of files that was where you wanted it to be and it just worked like a folder should. This is what I feel the IRC represents for us in data.</p><p>Coming back to data warehousing, we can replace a &#8220;folder of a dozen files&#8221; with a &#8220;schema of a dozen tables&#8221;.</p><p>So imagine that you have this schema of tables and access to many query engines like: Databricks, Snowflake, Redshift, DuckDB, Trino, and Spark.</p><p>How powerful would it be if you connect any of the engines to the schema above to read those tables, modify their data, and make new tables. Also, that others using other query engines could see those changes.</p><p>On top of that, this system does not include expensive copying of data, or working with FTP servers, Google Drive, or directly interacting with Azure Blob storage. You just connect your SQL engine to an Iceberg catalog to read and write your data. This is the promise of the IRC in conjunction with your data platform.</p><p>How might your data team operate differently in this world? This is why <a href="https://www.getdbt.com/blog/introducing-cross-platform-dbt-mesh">we&#8217;re launching cross-platform Mesh</a> to support these exact multi-stack engine scenarios that more than 50% of our Cloud customers already find themselves in today.</p><h1>A Year of Progress</h1><p>So where are we at today with Iceberg and where are we going as we enter summit season 2025 and beyond? The threads I want to pull on are about end-user adoption, data platform vendor integrations, and open source catalogs. I don&#8217;t have a crystal ball, but I&#8217;ll prognosticate a smidge.</p><h3>curiosity? high! adoption? lukewarm (but growing!)</h3><p>At my Iceberg breakout session at Coalesce in Las Vegas last October, I asked the analytics engineers in attendance to raise their hand if they&#8217;d heard of Iceberg &#8212; all of the hands went up. When I asked those with their hands up to keep them up if they felt they could explain Iceberg to the person sitting next to them, nearly all of the hands went down.</p><p>Tellingly, this included the folks who said their teams were already using Iceberg in production. This isn&#8217;t a problem: it&#8217;s Ryan Blue&#8217;s vision in action! More-so, this is the opportunity of Iceberg via IRCs: understanding the technology isn&#8217;t necessarily a prerequisite for adoption. Maybe one person on the team sets it up. For everyone else it&#8217;s business (analytics) as usual.</p><h3>Data Platforms are showing up in a big way for the IRC</h3><p>So what have the data platforms and other independent software vendors (ISVs) been up to in the past year?</p><p>HOLY COW &#8212; SO MUCH!</p><p>It's remarkable to see the entire ecosystem embrace an open-source Apache project as the foundation for their products. The vendors that have integrated deserve a huge round of applause. Yes, they&#8217;re just responding to customer demand, and yes, a real reason to invest in Iceberg is you can reallocate engineers away from maintaining proprietary table formats and work that drives more revenue.</p><p>Still, the industry's investment deserves praise, especially since taking a more self-interested and cynical approach would have been easier, at least in the short-term.</p><p>Six months ago, we predicted internally that most vendors would support the IRC spec within 6&#8211;12 months.</p><p>Today, after evaluating more private previews than I could possible count, what progress can we observe?</p><p>If we can interpret &#8220;Iceberg support&#8221; as being compliant with the spec as of six months ago, then our prediction is looking good. The only major outstanding work is something known as &#8220;external writes&#8221;.</p><p>However, as I&#8217;ve mentioned above, Iceberg itself is still evolving, so our prediction was poorly framed in the first place.</p><p>Maybe the right question to ask is</p><blockquote><p>When will IRCs be a stable abstraction such that:</p><ul><li><p>end users have a stable, fully-featured interface</p></li><li><p>the Iceberg spec can continue to evolve under-the-hood without heavily burdening data teams using Iceberg?</p></li></ul></blockquote><p>Perhaps this moment comes when data platform catalogs support external writes, and this will be true in six months. Time will tell!</p><h3>OSS catalogs: important but not for end users</h3><p>Databricks and Snowflake also deserve credit for also open sourcing their catalogs: <a href="https://github.com/unitycatalog/unitycatalog">Unity Catalog</a> and <a href="https://github.com/polaris-catalog/polaris">Polaris</a>, respectively. <a href="https://github.com/lakekeeper/lakekeeper">Lakekeeper</a> is another worth calling out for being written in Rust and improving quickly.</p><p>When data teams ask if I recommend self-hosting a catalog, my answer is largely &#8220;No!&#8221;. The exception here are teams that have either or both of</p><ul><li><p>enterprise security requirements (think: on-prem, self-managed data centers)</p></li><li><p>a dedicated data platform team with the know-how to deploy critical data infrastructure.</p></li></ul><p>The challenge here is that of uptime and availability. If the IRC is unresponsive, you can&#8217;t query the tables any more. A minority of teams will sign themselves up for this challenge. For most, I think your time is better invested elsewhere.</p><p>Beyond this small minority of data teams, the real value of these projects are for:</p><ul><li><p>data SaaS vendors: who need some catalog functionality</p></li><li><p>prospective data platform customers. who need help committing to use a proprietary catalog (&#8221;worst case we migrate away and run the OSS catalog ourselves!&#8221;)</p></li></ul><p>I don&#8217;t say this to cast doubt on the technology, in fact quite the opposite. All of these projects are being used today in production and are &#8220;battle-tested&#8221;. This usage serves to further refine the IRC as a standard. Everyone benefits from this, even users of proprietary catalogs.</p><h2>what might data platforms do differently?</h2><p>IRCs are the clearest option for making Iceberg truly an implementation detail, but adoption is hindered when data platforms don&#8217;t truly integrate the concept into their products. Some examples of this include requiring users to:</p><ul><li><p>create a second catalog within the data platform to make data available elsewhere</p></li><li><p>choose a unique object store path for the data when creating an Iceberg table</p></li><li><p>mount tables individually and manage their refresh</p></li></ul><p>Some data platforms are taking a cautious approach to Iceberg and REST catalogs, worrying that these might create a disjointed experience alongside their native, proprietary table formats. These platforms are instead focused on streamlining their lake house experience within their own product suite. While this concern is understandable, this becomes a game of chicken. Customers want interoperability so they risk losing customers by having a walled garden.. Iceberg has fundamentally changed how data teams evaluate tools&#8212;any platform without a clear Iceberg strategy now receives a "lock-in" red flag during vendor evaluations, even if said team has yet to start using Iceberg.</p><h1>what questions are on my mind for this Summer&#8217;s Summits and beyond?</h1><p><a href="https://www.icebergsummit2025.com/">Iceberg Summit</a> is happening next week both IRL in SF as well as virtual. You should check it out!</p><p>As far as what Iceberg announcements I&#8217;m hoping for and expecting come June, here&#8217;s a list of things that, if announced, would be leading indicators for accelerated Iceberg adoption:</p><ul><li><p>support query engines to write directly to external Iceberg REST catalogs</p></li><li><p>support mounting of a schema&#8217;s worth of Iceberg tables</p></li><li><p>full support for catalog vended credentials</p></li><li><p>any differentiated features that go beyond the scope of the actual Iceberg spec and are focused on UX and developer productivity</p></li></ul><p>If we get all of this and more, I still have some open questions</p><ul><li><p><strong>what&#8217;s the multi-region and/or multi-cloud story of Iceberg catalogs?</strong> Right now everything presumes the same cloud and same region or suffer painful egress and latency costs</p></li><li><p><strong>how to federate RBAC across query engines?</strong> we still heavily rely upon data bases to <code>GRANT</code> access to data. If the data its RBAC is managed in the IRC catalog, how is the query engine configured?</p></li><li><p><strong>what are best practices for working with multiple catalogs?</strong> more on that in a future post &#128521;</p></li></ul><p>Thanks so much for reading &#8212; as always the comments and my DMs are open.</p><p> Should you be left wanting more, there&#8217;s four more sections that shy less away from the technical weeds.</p><h1>Technical Weeds</h1><h2>What about Delta Lake?</h2><p>Some of you will be frustrated that I didn&#8217;t bring up Delta Lake.</p><p>At the time of the Tabular acquisition I remember some people speculating things like this</p><blockquote><p>Databricks acquired Tabular to squash Iceberg in favor of their open table format Delta Lake.</p></blockquote><p>It was refreshing to see that cynical take be put to rest so soon when <a href="https://vimeo.com/1012543474">this interview</a> was posted between Michael Armbrurst and Ryan Blue (creators of Delta and Iceberg, respectively). I love this quote so much:</p><blockquote><p>It was never our intention to start a "format war" and have people spend so much time thinking about storage. It should just work and very few people should have to think about it. You should be able to focus on doing analytics.</p></blockquote><p>To achieve this north star of "you don't have to think about it," they aim to standardize the two projects as much as possible. This isn't just lip service! One example touched on was their plan to standardize the <code>VARIANT</code> type implementation by <a href="https://github.com/apache/parquet-format/blob/master/VariantEncoding.md">pushing it upstream into parquet itself</a>.</p><p>Another great example came through <a href="https://docs.delta.io/latest/deletion-vectors.html">Deletion Vectors</a> (DVs)&#8212;a feature that Delta tables had but Iceberg lacked. While Iceberg had a comparable feature called "equality deletes," it wasn't nearly as performant.</p><p>Now this work has been merged into the spec, slated for release with the Iceberg V3 table spec. This work represents a true data industry team effort with contributions from engineers at Databricks, Snowflake, Netflix, Google and more. If you're feeling brave, curious, and reading &#8220;roaring bitmap&#8221; doesn&#8217;t send you running for the hills, check out <a href="https://github.com/apache/iceberg/pull/11238">the PR</a> and click around!</p><p>There's been much discussion about technology that converts between table formats, like Databricks' UniForm and Apache XTable. While these tools are essential in the short term, they'll ultimately become redundant. I'm seeing strong signals that the Delta and Iceberg teams agree not only on what the most important problems are, but also on how they should be solved. But maybe I&#8217;m being overly optimistic!</p><h2>What about S3 tables?</h2><p>I&#8217;ve long been bullish on the IRC, but the <a href="https://aws.amazon.com/blogs/aws/new-amazon-s3-tables-storage-optimized-for-analytics-workloads/">announcement of S3 Tables Buckets</a> and <a href="https://meltware.com/2024/12/04/s3-tables.html">Nikhil Benesch&#8217;s analysis</a> made me question the assumption.</p><p>I had been thinking of IRC as an abstraction over object storage, i.e. the REST Catalog would deal with creating, naming, finding iceberg files without you having to think about it.</p><p>With Table Buckets it&#8217;s the converse. you think about S3, but don&#8217;t have to create/manage reason about an IRC. This isn&#8217;t necessarily a bad thing for both query engine developers nor end users.</p><p>For a query engine developer, you could argue that it&#8217;s easier for query engines to integrate with S3 than it is to integrate with a still evolving OpenAPI spec. They&#8217;re all already familiar with object storage!</p><p>For end users analytics engineers like us, IRCs can be a hurdle to initial Iceberg adoption because you have to set one up before you can create a single table. S3 Table buckets radically simplifies this, in that they have their own catalog behind the scenes. Not only is this catalog wildly performant like many products out of AWS, it also automatically handles maintenance tasks like file compaction. This approach has already borne fruit imho given there&#8217;s a plethora of Iceberg quickstart tutorials out there now (<a href="https://aws.amazon.com/blogs/storage/connect-snowflake-to-s3-tables-using-the-sagemaker-lakehouse-iceberg-rest-endpoint/">Snowflake</a>, <a href="https://duckdb.org/2025/03/14/preview-amazon-s3-tables.html">DuckDB</a>)</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bo9x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bo9x!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 424w, https://substackcdn.com/image/fetch/$s_!bo9x!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 848w, https://substackcdn.com/image/fetch/$s_!bo9x!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 1272w, https://substackcdn.com/image/fetch/$s_!bo9x!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bo9x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png" width="1422" height="296" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:296,&quot;width&quot;:1422,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58777,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/160067584?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bo9x!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 424w, https://substackcdn.com/image/fetch/$s_!bo9x!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 848w, https://substackcdn.com/image/fetch/$s_!bo9x!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 1272w, https://substackcdn.com/image/fetch/$s_!bo9x!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F718b1791-62da-4b45-8b20-6f64c8e97ecf_1422x296.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>I&#8217;m still very wary about asking analytics engineers to think about S3 paths when writing SQL to the extent that I still think it is an anti-pattern. This is why <a href="https://docs.getdbt.com/reference/resource-configs/snowflake-configs#base-location">dbt by default will manage the path for you when materializing a Snowflake-managed Iceberg table</a>. With S3 Table buckets there&#8217;s not a clear notion of a namespace to hierarchically organize tables (think: <code>datababe.schema.table</code> ).</p><p>However, just a few months later AWS S3 announced that Table buckets are available via the IRC API, so S3 Table Buckets has proven to not be opinionated on API. Perhaps their approach is the correct one in providing both UXes.</p><p>However, there&#8217;s an opportunity to simplify. While it is only natural that the S3 team would collaborate with the AWS Data Catalog team, the result is a rather disjoint end user experience.</p><p>It should not surprise us that the AWS S3 team wants to bring their expertise to making data lakes management easier and cheaper. I&#8217;d count on the team to continue evolving this product over time, so you should keep your eyes peeled as well.</p><h2>IRCs: what specifically do they do?</h2><p>At the risk of oversimplifying, what the IRC does is closes some remaining gaps that kept SQL on data lakes from feeling like the SQL you&#8217;d expect.</p><h3><s>Attention</s> Naming is all you need</h3><p>One powerful abstraction of traditional SQL databases: all you need to query a table is its name, and you never have to think about where the table&#8217;s data is stored. You&#8217;ve likely never even thought about how much easier this makes your life until you don&#8217;t have it anymore. But, in data lakes, often you need to know the table&#8217;s path in the object store (e.g. S3) for it&#8217;s data.</p><p>refers to example normal SQL three-part name <code>my_db.my_schema.my_table</code> data lake object store path <code>S3://my-data-lake/some/folders/my-table/metadata.json</code></p><p>I believe that asking analytics engineers to think about S3 paths when writing SQL is an anti-pattern. This is why <a href="https://docs.getdbt.com/reference/resource-configs/snowflake-configs#base-location">dbt by default will manage the path for you when materializing a Snowflake-managed Iceberg table</a></p><h3>sir, were you aware that was a red light you just drove through?</h3><p>The other problem that the IRC solves is more behind-the-scenes. When I run a query in Postgres, I never think:</p><ul><li><p>I hope this file lands on disk successfully</p></li><li><p>I hope no one else is trying to write to this table right now</p></li><li><p>What if someone else deletes the files I'm writing?</p></li></ul><p>We SQL users take this all for granted, but this isn&#8217;t possible with a data lake unless you have a catalog! Postgres and many other DBs play &#8220;traffic cop&#8221; for you so you don&#8217;t have to. The IRC fills this role for you on the lake</p><h3>one API to rule them all</h3><p>The last problem relates to simplifying how data platforms and query engines integrate with Iceberg. Spark has never had a problem integrating with Iceberg because Iceberg is implemented in Java. But how do you</p><ul><li><p>integrate the Iceberg Java library if your database is written in Python?</p></li><li><p>read from an Iceberg catalog written in Go with a query engine written in Rust?</p></li></ul><p>The IRC solves this problem by proposing a language agnostic API and a spec for a backend service that does some work that a query engine developer previously would have had to build. This is great because it lowers the barrier to adoption by reducing the required engineering effort to integrate.</p><h2>What about IRC&#8217;s vended credentials?</h2><p>Once you already have an IRC set up and configured (non-trivial work in it&#8217;s own right), the next step is to give a query engine access to it. To do so, by default the query engine must authenticate to two things in order to be able to read and write to the IRC:</p><ul><li><p>the IRC itself (typically with a personal access token)</p></li><li><p>the object store that has the files associated with the Iceberg table</p></li></ul><p>Not only is this a high-friction set-up, the experience isn&#8217;t very intuitive. For example, in this setup, when you ask the IRC for a particular table that you&#8217;d like to read, it will return to you an object store path for a file that has more info. If you don&#8217;t have access to this file in (e.g. S3), you&#8217;re SoL. That&#8217;s why this pattern also requires that the query engine also has direct access to the object store.</p><p>However, it doesn&#8217;t have to be this hard! With Vended Credentials, you only need to authenticate to the IRC, and the IRC will provision you access to the files in object store. This is a much simpler workflow than what I experienced my first time using IRCs over a year ago.</p><p>Vended credentials have been in the Iceberg spec since last June, but only recently has it been supported in platforms like Snowflake, Databricks, and SageMaker Lakehouse after a number of preview periods.</p><p>one query engine writing directly to an external IRC is also vastly simplified by vended creds. You just connect to their IRC and write the table directly without ever knowing where the data is stored.</p><p>How great to live in a world where when another team needs data from you, you never have to connect to their FTP server, Google Drive, Azure blog storage account to put the data, you just write to their IRC.</p><p>A consequence of vended credentials is that the IRC becomes critical path for accessing data, it means that you&#8217;ll have to refactor your connection later should you decide to stop using an IRC or select a different one. However the abstraction is more simple because you only need to tell your query engine about the IRC and not about object storage anymore.</p><p>The bear case here for vended credentials here is that it introduces a third access model distinct from the native RBAC of storage (i.e. IAM Policy) and the query engine (think database roles and privileges). However, you can&#8217;t have a catalog without RBAC, and the closer that RBAC lives to the data the better. It doesn&#8217;t make sense that a query engine should have roles for accessing the data, especially in a world where multiple query engines will access it!</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>this post is long enough as is! but if you&#8217;re mostly up-to-speed, this section might be for you</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>https://xkcd.com/2501/</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>&#8220;Works like a table&#8221; effectively means ACID transactions</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Why the modern data stack matters in the AI age]]></title><description><![CDATA[I'm at the Modern Data Stack. I'm at the Intelligence Explosion. I'm at the combination Modern Data Stack Intelligence Explosion.]]></description><link>https://roundup.getdbt.com/p/why-the-modern-data-stack-matters</link><guid isPermaLink="false">https://roundup.getdbt.com/p/why-the-modern-data-stack-matters</guid><dc:creator><![CDATA[Jason Ganz]]></dc:creator><pubDate>Sun, 23 Mar 2025 14:04:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/98967b44-da1d-4e5b-9fe7-9c8bef8b2100_1792x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There&#8217;s a mystery that&#8217;s been rattling around in my head.</p><p>In 2021 through early 2023, the data space and specifically &#8220;The Modern Data Stack&#8221; was arguably the highest energy, most dynamic area of the tech sector, and certainly dominated the discourse.</p><p>In late 2022, the ChatGPT moment happened and all of the oxygen, immediately, became sucked up into AI.</p><p>The mystery on my mind has been - what is the connection here? Is it just an accident of history that right before we got AI systems that actually work at scale, we were focusing on centralizing and modeling our data?</p><p>It&#8217;s completely possible that the answer is yes. It&#8217;s felt that way at times, but over the past year and particularly over the past few months, as AI systems move outside of narrow chat windows and become more integrated into our workflows, two things are becoming clear:</p><ol><li><p>Complex AI workflows are going to draw many of the learnings from data engineering</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!He7-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!He7-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 424w, https://substackcdn.com/image/fetch/$s_!He7-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 848w, https://substackcdn.com/image/fetch/$s_!He7-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 1272w, https://substackcdn.com/image/fetch/$s_!He7-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!He7-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png" width="1456" height="522" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:522,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!He7-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 424w, https://substackcdn.com/image/fetch/$s_!He7-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 848w, https://substackcdn.com/image/fetch/$s_!He7-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 1272w, https://substackcdn.com/image/fetch/$s_!He7-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F386f7a9e-5294-40fb-80ed-f42df2c86b7c_1820x652.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol start="2"><li><p>LLMs will need access to the data produced by your analytics workflows in order to be truly useful for many use cases</p></li></ol><p>Both of these facts feel relatively obvious now and immediately valuable in the near term, compared to other ideas that felt more &#8230; grasp-at-strawsy like &#8220;data teams will be the keepers of your organizational data to custom fine tune models for your business&#8221;.</p><p>So what changed? Why do these feel practical and useful now compared to even a year ago?</p><p>Because we&#8217;re actually starting to roll LLM systems out in the real world - and <em>quickly.</em> Even in their nascent state it&#8217;s clear that this is not hype, that there&#8217;s real value here, today. But it&#8217;s also becoming clear that the problems and lessons that brought us to the modern data stack haven&#8217;t gone away in this brave new world - although the ways those problems are solved and the systems they are being solved within may change dramatically.</p><p>At this point, you are probably familiar with the frustrating experience of going to a new website and being presented with some sort of chatbot interface and not being entirely certain what questions you can ask it. </p><p>The chatbot probably works extremely well for the set of context it has access to. But you probably don&#8217;t know what exactly it has access to, and that underlying guessing game means that what should be (and often are!) incredibly useful interfaces end up feeling slapped on and piecemeal.</p><p>The problem is largely not the models which have gotten extremely good for most questions you might ask of them. The problem is that they often literally don&#8217;t have the right information to give you the correct answer. </p><p>It does not matter how smart a model is, or how good it is at in context learning, if the only answer or a path to the answer <em>can&#8217;t be added into its context</em> because its locked somewhere in a single reddit post from last week, a proprietary document or in your data warehouse. The good news is that we&#8217;re <a href="https://www.anthropic.com/news/model-context-protocol">quickly moving towards</a> a world where that context isn&#8217;t locked up anymore and there are protocols and standards for accessing it. </p><p>But what context should be provided? How do you know it&#8217;s right? Is it accurate? Who has access to it?</p><p>These are all questions we&#8217;re going to have to learn to answer in our AI system. And it&#8217;s gonna be a doozy.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fwP5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fwP5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 424w, https://substackcdn.com/image/fetch/$s_!fwP5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 848w, https://substackcdn.com/image/fetch/$s_!fwP5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 1272w, https://substackcdn.com/image/fetch/$s_!fwP5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fwP5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png" width="1456" height="363" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:363,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:119766,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/159638568?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fwP5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 424w, https://substackcdn.com/image/fetch/$s_!fwP5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 848w, https://substackcdn.com/image/fetch/$s_!fwP5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 1272w, https://substackcdn.com/image/fetch/$s_!fwP5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97b35d49-5629-4c9a-b361-bcd467ecea94_1726x430.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p></p><h3>It is extremely non-trivial to feed the right context to LLMs at the right time</h3><p>Probably the most interesting thing about adding new knowledge sources to LLM workflows is the way that LLMs magnify both the best and worst aspects of working with a particular system, almost to a fault. The time to answers is quicker, you can pull threads together - you&#8217;re always feeling movement. But any cracks in the system you&#8217;re using to feed the model context become immediately obvious.</p><p>We talk about &#8220;context&#8221; like it&#8217;s a monolith, but the underlying context we&#8217;re feeding is ultimately going to look something like &#8220;all of the mechanisms that humans have created for storing and conveying information&#8221;.</p><p>Let&#8217;s look at how this is going in practice - the good and the challenges:</p><p><strong>LLMs + Internet search: </strong></p><ul><li><p>What it is: Integrate public web data into LLM queries</p></li><li><p>Why it&#8217;s great: This massively broadens the ability of LLMs to pull in information. This is really useful for when you require specific, granular information pulled from the real world (I like to use this for finding restaurants).</p></li><li><p>The challenges: SEO-bait works extremely well on LLMs. Under the hood, it&#8217;s still performing some sort of traditional web search - the same tools and tactics that people use to climb to the top of Google searches work on an LLM. Try an experiment right now - go do a Google search for &#8220;Best Pillow 2025&#8221;. Do you have any reasonable way of breaking down the answers and getting to ground truth?<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> If it&#8217;s hard for you, it&#8217;s going to be hard for the LLM.</p></li></ul><p><strong>LLMs + Internal company documentation: </strong></p><ul><li><p>What is it: Search over internal, unstructured data like NotionAI and Slack</p></li><li><p>Why it&#8217;s great: I want to be incredibly clear - I <strong>love</strong> NotionAI. It is the single closest I&#8217;ve ever felt to being able to fully wrap my head around a complex organization of hundreds of people and be able to learn what other teams are working on. It fundamentally broadens my aperture in knowing what is going on at dbt Labs and why - from answering quick policy questions to making sure I can keep track of the latest company objectives.</p></li><li><p>The challenges: Unless you have incredibly strong document hygiene, you&#8217;re going to find messy and conflicting information. A simple question like &#8220;when is Coalesce 2025&#8221; can sometimes end up with 3 results - maybe we were initially thinking a different date and that date still lives in a document somewhere. Maybe someone just accidentally typed in the wrong date and left it there. The models need, as part of their context, not just all of your documentation, but signals as to which documentation is up to date, correct, organizationally approved. </p></li></ul><p><strong>LLMs + Metrics and structured data: </strong></p><ul><li><p>What it is: All of the data that lives in your data platforms, your key metrics, customer information, business entities</p></li><li><p>Why it&#8217;s great: I don&#8217;t want to be too dramatic here but being able to analyze a complex dataset using conversational analytics on top of a <a href="https://www.getdbt.com/blog/introducing-dbt-for-snowflake">trustworthy interface</a> feels like magic. Especially for data that you know well, it honestly feels like you are getting superpowers, with the answer to any question you might have available at the tip of your fingers.</p></li><li><p>The challenges: What could possibly go wrong giving an LLM access to your data warehouse? Text-to-sql is good and getting better, I have no doubt that we will get to the point where, given a well structured problem and sufficient context about the underlying data, LLMs are going to be very successful at getting an answer that is reasonable and correct. But will it:</p><ul><li><p>Be consistent across an organization? Will it be a single source of truth, based on vetted and well understood business concepts? Or will it be a very clever vibe coded 1200 line sql script that no one realistically is ever going to read.</p></li><li><p>Can you put it in the hands of your executives and have them <strong>trust the output?</strong> Not just as interesting anecdata - but as something that they can make decisions and take action off of?</p></li><li><p>Is it going to be governed? Will it know who the end user behind the query is, what data they should have access to and what they shouldn&#8217;t?</p></li></ul></li></ul><p>Each of these information sources add impressive and interesting capabilities to the underlying power of LLMs. When combined in a single interface, there is a combinatorial explosion of usefulness<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> as the different information sources each unlock new capabilities.</p><p>Imagine you are the CEO of a restaurant chain planning whether to expand into a new territory. With access to these three information sources, you could prepare a deep research style report that:</p><ul><li><p>Understands the business context and strategy for a potential expansion based on your internal company documentation</p></li><li><p>Has the ability to query and understand the actual data and financial metrics available to you via your data platform</p></li><li><p>Searches the web for macro-level data and benchmarks about the location you&#8217;re planning to expand into</p></li></ul><p>Sounds incredible right? This is totally doable, today - although it requires a bit of duct-taping systems together to make it work.</p><p>But it also sounds &#8230; daunting. Because each of the failure modes described above can cascade throughout this system. What if the model YOLOs a revenue definition that&#8217;s reasonable, but based on information from an out of date company document? What if it pulls benchmarked financial data from the web that ended up being wrong? The real world contains a whole lot of complexity and our systems need to be designed in a way to get the benefits here while having the appropriate guardrails to establish some sort of ground truth. </p><h3>So why now</h3><p>I want to return now to the question that I posed at the start of this - why were we all convinced that building systems for centralizing and managing your company data at scale was the right problem to solve directly before LLMs started to soar?</p><p>The answer lies in recognizing that what seemed like an accident of timing was actually a foundation being laid. The Modern Data Stack is not just about better dashboards &#8212;but about creating standardized and reliable workflows and interfaces across your entire data ecosystem that can power increasingly sophisticated use cases, at scale. It turns out this is just as necessary for AI as it is for humans.</p><p>We built the modern data stack to address fragmentation, improve data governance, and ensure consistent, reliable data. What we didn't fully realize at the time was that this was an essential piece of context for LLM applications as well.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>We finally have both data foundations and the AI models to make genuinely useful, reliable AI-driven and data-enriched workflows possible. The explosion of interest in AI didn't displace the need for the modern data stack&#8212;it just took some time for these systems to begin to speak to each other.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p>Moving forward, we&#8217;re going to increasingly see the role of the data practitioner as providing governed, trustworthy data for LLMs and potentially, for using many of those same systems to enable safe, reliable deployment of AI systems.</p><p>I feel huge opportunity in this area. I also feel a lot of responsibility for us, as a community, to get this right. These systems are being experimented with, and in some cases deployed <strong>today</strong> and there is real institutional heft behind their rollout. We have the tools and the (forgive the pun) agency to make an impact on how that happens.</p><p>There is a tremendous amount of brainpower that reads this newsletter - people orchestrating the most complex data flows at the largest organizations in the planet and building the tooling that will enable it. There are a lot of unknown unknowns in terms of how we build the bridge from LLMs to our structured data, but I believe we&#8217;ve got the right set of humans in place here to begin meaningfully answering this question.</p><p>Let&#8217;s get after it. Want to talk about any of this? <a href="https://www.getdbt.com/community/join-the-community">You know where to find me</a>.</p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Ok, for this one ground truth is &#8220;it&#8217;s a pillow, stop thinking about it so hard&#8221; but you get the point</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>And danger!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Actually <a href="https://roundup.getdbt.com/p/analytics-intelligence-everywhere">we did realize it</a>, it just took some time to connect the threads</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>After we ran our first experiment showing the value of a Semantic Layer in natural language questioning - Benn Stancil raised the question of whether we were likely to get &#8220;<a href="https://www.cs.utexas.edu/~eunsol/courses/data/bitter_lesson.pdf">bitter lessoned</a>&#8221; as models improved and cut out the need for intermediary interfaces. It&#8217;s an important question and one we&#8217;ll dive into more in the future. But even systems of arbitrary intelligence need tools! It doesn&#8217;t matter how smart an LLM is, if you want an LLM to unload your dishwasher, you&#8217;re going to have to put in a robot. Will the same hold true for the three methods of gathering information listed above? Time will tell, but signs point to yes in the near term.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[How AI will Disrupt Data Engineering As We Know It]]></title><description><![CDATA[It will be hard to compare data engineering in 2024 and data engineering in 2028 and say &#8220;those are the same things.&#8221;]]></description><link>https://roundup.getdbt.com/p/how-ai-will-disrupt-data-engineering</link><guid isPermaLink="false">https://roundup.getdbt.com/p/how-ai-will-disrupt-data-engineering</guid><dc:creator><![CDATA[Tristan Handy]]></dc:creator><pubDate>Sun, 16 Mar 2025 11:02:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!j9HW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!j9HW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!j9HW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!j9HW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!j9HW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!j9HW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!j9HW!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp" width="1200" height="685.7142857142857" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:832,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:661200,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/webp&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://roundup.getdbt.com/i/159149000?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!j9HW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 424w, https://substackcdn.com/image/fetch/$s_!j9HW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 848w, https://substackcdn.com/image/fetch/$s_!j9HW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 1272w, https://substackcdn.com/image/fetch/$s_!j9HW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1a805f3-2215-4c7a-a4b9-9b0b4aea3642_1792x1024.webp 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" style="height:20px;width:20px" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Credit: Dall-E</figcaption></figure></div><p><a href="https://roundup.getdbt.com/p/a-year-of-innovation-in-ai-part-1">Last time I wrote</a> I dove into a bunch of AI advancements that have happened over the past year. Reasoning models, chain of thought, inference-time compute, etc. And there&#8217;s more to explore there and I need to return to that series.</p><p>But for this week&#8217;s issue I want to pause on that and talk about AI from a different perspective. I want to think, as rationally as we can about an uncertain future, about how the job of the data engineer will change over the coming 1-2-3 years as a result of AI.</p><p>I am quite confident these changes will be massive. I think the word <em>disrupt</em> is not at all hyperbole&#8212;<strong>I think it will be hard to compare data engineering in 2024 and data engineering in 2028 and say &#8220;those are the same things.&#8221;</strong></p><p>It just turns out that many of the tasks that data engineers do every day are tasks that AI can provide tremendous leverage in. I don&#8217;t know what the % efficiency metric will be&#8212;20%? 50%? 80%?&#8212;but I think it&#8217;s totally possible that it&#8217;s on the higher end of that range.</p><p>I think that will be both good for data engineers and good for the companies they work for. Data engineers will have more work to do than ever (<a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons paradox</a> at work), but it will be more strategic, add more value to companies, and will likely see them get raises. Companies will get the higher-functioning, higher-ROI, more accessible data systems that have always seemed out of reach.</p><p>In this post I want to look at the specific tasks that data engineers spend their time on, and look at how addressable-or-not these tasks are with AI.</p><p>Let&#8217;s dive in.</p><h2>The Tasks of a Data Engineer</h2><p>AI doesn&#8217;t replace jobs, it automates tasks. So let&#8217;s look at the tasks that someone leveled a Senior Data Engineer most commonly spends time on today. Of course, it should go without saying, YMMV: there is no single canonical job description for a data engineer. But I think we can still get close enough to reason about.</p><p><strong>What does a Senior Data Engineer spend their time on?</strong></p><ul><li><p>Create technical artifacts</p><ul><li><p>Landing new data. Building and maintaining automated data ingestion pipelines.</p></li><li><p>Transforming raw data into bronze, then silver, then gold layers. Includes authoring brand new pipelines as well as refactoring existing pipelines to handle new business requirements.</p></li><li><p>Defining metrics on top of transformed data.</p></li><li><p>Writing tests and documentation.</p></li><li><p>Monitoring costs of data infrastructure and refactor code to optimize performance characteristics.</p></li><li><p>Reviewing pull requests from peers.</p></li><li><p>Monitoring production jobs and declare incidents related to either pipeline failures or observability / quality issues. Investigate and resolve those incidents.</p></li></ul></li><li><p>Liaise with stakeholders and peers</p><ul><li><p>Answer questions about currently-available data assets like &#8220;which data set should I use?&#8221; and &#8220;can I trust this?&#8221;</p></li><li><p>Collaboratively design changes to existing data assets to accommodate new requirements. Conversations like &#8220;what are the edge cases I need to know about when calculating cost of goods sold?&#8221;.</p></li><li><p>Stakeholder enablement and education.</p></li><li><p>Designing the overall architecture of the DAG, including modularization, team boundaries and ownership, modeling best practices, etc.</p></li></ul></li></ul><p>I&#8217;m sure you could find some other things to put on these lists, but I feel like they&#8217;re pretty representative. Feel free to tell me what I&#8217;m forgetting.</p><h2>The Role of Frameworks and Tooling in an AI-centric World</h2><p>Many of the above tasks are already doable with AI. And I want to talk more about that. But before I get there, it&#8217;s important to talk about frameworks, and how important frameworks are to an AI-centric world.</p><p>Claude 3.7 will write you almost any kind of code you could want. You can absolutely build a pipeline from the ground up, building ingestion, transformation, testing, etc. in Lisp. In Assembly. In the style of Guido van Rossum. Whatever. You could even imagine a world in which you had 1,000 distinct pipelines and every one was written in a different language or framework or set of conventions. All reading from and writing to a shared corpus of tabular data.</p><p>But: just because it is now conceivable to create such a codebase, <em>is it a good idea?</em></p><p>The answer is: no. <strong>Obviously not.</strong> Just as a team of humans would have an impossible task of maintaining such a Frankenstein, the heterogeneity would make it intractable for LLMs as well.</p><p>This <em>intuition pump</em> is helpful to get us to an important conclusion: AI will be more effective as an accelerant when:</p><ul><li><p>a code base is fewer lines of code (less room for error)</p></li><li><p>a code base is more consistent rather than less consistent: in languages, in coding conventions, in design</p></li><li><p>a code base uses consistent CI/CD and other developer tooling</p></li><li><p>a code base uses consistent and well-documented logging / observability</p></li><li><p>a code base uses well-documented best practices also employed by a large community of users.</p></li></ul><p>In general: code bases that are more concise, more homogeneous, and use standard tools that are well-documented in the model training data (i.e. the public internet) will be more comprehensible by AI systems.</p><p>One of the best ways to make all of these things true at the same time is to use frameworks and open standards. Claude 3.7 knows how to build reliably Airbyte ingestion pipelines because the framework is well documented and there are a lot of examples published. It&#8217;s also fantastic at writing dbt code for the same reasons. If you&#8217;re able to give it an environment where it can test its own code and validate downstream models as a part of its CoT&#8212;code quality goes up even further. Standardized frameworks also emit well-understood error messages, which pushes code quality up further.</p><p>In short: good frameworks, tooling, and standards are <em>just as important</em> for AI as they are for humans. And the wonderful thing about AI is: it is infinitely adaptable to whatever frameworks, tooling and standards tooling you want to use. No learning curves. Finally the promise of a consistent code base.</p><h2>How many of these tasks are already doable?</h2><p>Got it, frameworks are powerful in an AI world. Now let&#8217;s look at the individual tasks that data engineers spend time on and try to figure out how tractable they are.</p><p>In answering this question I am <em>not</em> going to assume massive improvements in model capability. Even with modest improvements I believe all of this will become true. What is fundamentally needed is productization of currently-available models directed at the specific needs of data engineers, not the invention of new frontier tech.</p><h3>Creating Technical Artifacts</h3><ul><li><p><strong>Ingestion pipelines</strong> With nothing but Cursor you can already <a href="https://en.wikipedia.org/wiki/Vibe_coding">vibe code</a> your way to a working ingestion pipeline from basically any data source with a publicly-available API. You can already add pagination and solve edge cases and inject instrumentation. It&#8217;s unclear, though, if this is actually what is needed. I still fundamentally don&#8217;t think most data movement code should be written and maintained within the walls of an individual company&#8212;AI or no, I still want to hire a vendor or support a community project. Data engineers shouldn&#8217;t be spending a lot of time on this problem today and likely shouldn&#8217;t be in the future either. When a custom build is required, AI can already do it well; try it yourself in Cursor today.</p></li><li><p><strong>Authoring new data transformation assets</strong> If you&#8217;re using dbt, data transformation is very soon to become <em>heavily</em> AI-enabled. Whether you&#8217;re building models, writing documentation and tests, or defining metrics, this is coming to you <em>very soon</em>. We demoed some of these capabilities at Coalesce and will have more to share on Wednesday at our <a href="https://www.getdbt.com/resources/webinars/dbt-developer-day">dbt Developer Day</a>. While we are certainly still in the early stages of where we ultimately want to get to, dbt Copilot is already <em>very</em> good at all of these authoring tasks and there is a very clear path to getting even better. Nick Shrock, in one of his best posts ever, called dbt and tools like it <a href="https://dagster.io/blog/the-rise-of-medium-code">medium-code frameworks</a>. It turns out that medium-code frameworks are extremely well-suited for AI. Having personally used dbt Copilot, I anticipate the time required to author new transformation code for data engineers will drop very significantly.</p></li><li><p><strong>Multi-file refactoring</strong> One thing that Cursor now does super-well is stage multi-file edits as a result of a single prompt. You could imagine a similar prompt in dbt: &#8220;refactor code in these two parts of the DAG to minimize duplication; combine models where appropriate.&#8221; Or: &#8220;A new field was added in this data source. Please pull that field all the way through to the DAG into [X] final model.&#8221; These types of refactoring tasks are low-creativity but highly time-intensive. Implementing them is product work, not research. The opportunity to get a handle on tech debt with tooling like this makes me giddy.</p></li><li><p><strong>Automated incident resolution</strong> Imagine providing the entire log output of a <code>dbt run</code> and the associated project code into a context window and getting back a diagnosis and proposed resolution. While we haven&#8217;t productized this experience yet, it&#8217;s not hard to experiment with this yourself hackathon-style. Imagine a world in which, following a pipeline failure, a full PR was queued up and run through CI, with a full report waiting for you and just ready to hit the merge button. We should anticipate this type of experience for data engineers in the not-too-distant future. How much time are you currently spending on break/fix? Slash it significantly.</p></li></ul><p>I&#8217;m going to pause there because I&#8217;m at risk of boring you. Suffice it to say that I truly believe that a) much data engineering work has already been framework-ized, and b) AI will now make creation of, iteration on, and maintenance of these technical artifacts <em>far more efficient.</em> And for the aspects of data engineering that are not yet framework-ized (dbt or otherwise), there will be tremendous gravity towards pulling them into a framework because of the leverage that these types of high-quality AI experiences will provide.</p><h3>Liaising with stakeholders and peers</h3><p>There are countless people throughout the business who use data as a core part of their jobs, and data engineers are <em>constantly</em> fielding questions from them. I won&#8217;t re-list them all here, but if you&#8217;re a data engineer you know the drill. Forever, the hope of &#8220;self-service&#8221; has been the hope that these data users would not need to lean on data engineers in this way&#8212;these interactions inject friction and slowness that neither side wants.</p><p>This fully actualized self-service has never actually materialized, and the status quo has been frustratingly persistent. But I&#8217;m optimistic that we have more of a path today than ever.</p><p>The easiest thing to do for any technology vendor at the very onset of the AI era was to take all of the domain-specific context that you had and surface it to users in a chat interface. And we did the same thing. It was (and is) quite good&#8212;it does a great job of allowing users to ask business questions and answering them with semantic-layer-governed responses.</p><p>The problem with this approach is that users don&#8217;t actually want to interact with dozens of chat interfaces. They don&#8217;t want to remember to go to a given tool to get one type of answer and another tool for another type of answer. There will not be 30 chat experiences all with different context. There will be one&#8230;or maybe just a few. But likely a single dominant one.</p><p>This is how <a href="https://stratechery.com/aggregation-theory/">aggregators</a> work. You likely don&#8217;t use a bunch of different search engines&#8212;you probably just use one, and it is probably Google. This is how chat will go as well.</p><p>The problem is, Google could scrape the web and respond to all queries based on that knowledge. But ChatGPT cannot know all of the information you want to ask it questions about (at least, yet). That lack of business context is the problem.</p><p>That&#8217;s where a <em>context protocol</em> comes in. A context protocol&#8212;a somewhat new topic in the public AI conversation&#8212;is a standardized way for services to provide additional context to models via an open protocol. The most promising one today is called <a href="https://modelcontextprotocol.io/introduction">MCP</a>, but whether or not MCP wins, the awareness/excitement/support for this idea has developed a ton of momentum and I am fairly convicted that <em>something like this</em> will become real and widely-supported.</p><p>There will be a large number of context providers (every source of valuable enterprise context) and a large number of context consumers (different products with AI capabilities). There is no way to create point-to-point integrations to facilitate this. A protocol will be needed if we are going to see the right type of advancements, and I think it will happen.</p><p>Imagine that your license to ChatGPT enterprise or Claude Desktop or whatever <em>already came with</em> a connection to all of the metadata about every piece of structured data you had access to. What was there, how trustworthy it was, how suitable it was for the analysis you were describing, etc. I think that, very quickly, you would find yourself asking questions of your friendly AI rather than shoulder-tapping your colleague in data engineering.</p><p>That&#8217;s not to say that the existing relationship would <em>go away</em>, but I do think that this would represent a true reset of the working relationship between data engineers and downstream business stakeholders&#8212;one that both sides would benefit from.</p><h2>Where does that leave us?</h2><p>Over the past two years, critical innovations have been made in foundational AI technology. Chain of thought, reasoning models, inference-time compute, agentic workflows. These are the ingredients needed to build the AI-enabled data engineering future. But they are now here.</p><p>And open frameworks&#8212;from dbt to Spark to Airbyte to others&#8212;have become widely deployed. This makes it possible to create great framework-specific AI tooling, both for the commercial stewards of those frameworks (including us), but also by any other vendor.</p><p>The commercial incentive to innovate here is high, and there could not be more attention on delivering these types of benefits within companies of all sizes. This is going to happen, and data engineering as a profession is never going to be the same.</p><p>So what? Time to get a new job? Data engineers are obsolete?</p><p>Hardly. Data engineers, one of the hottest jobs of the last decade, will stay hot. But practitioners will be pushed in one of three directions: towards the business domain, towards automation, or towards the underlying data platform.</p><ul><li><p><strong>Data Platform Engineers</strong> will become ever-more-important. They don&#8217;t spend their time building pipelines, but rather on the infrastructure that pipelines are built on. They are responsible for performance, quality, governance, uptime.</p></li><li><p><strong>Automation Engineers</strong> will sit side-by-side with data teams and take the insights coming out of data and build business automations around it. As a data leader recently told me: &#8220;I&#8217;m no longer in the business of insights. I&#8217;m in the business of creating action.&#8221;</p></li><li><p><strong>Data Engineers</strong> that are primarily obsessed with business outcomes will have ample opportunity to act as enablement and support for the insight-generation process, from owning and supporting datasets to liaising with stakeholders. The value to the business won&#8217;t change, but the way the job is done will.</p></li></ul><p>You&#8217;ll hear a lot more from us <a href="https://www.getdbt.com/resources/webinars/dbt-developer-day">on Wednesday</a> about how we&#8217;re making this future a reality for dbt users. I&#8217;m excited to disrupt the decade-long status quo and build something better.</p><p>- Tristan</p>]]></content:encoded></item></channel></rss>