Building a multimodal lakehouse for AI (w/ Chang She)
The CEO of LanceDB and Tristan go deep into the bridge between analytics and AI engineering
Welcome back to The Analytics Engineering Podcast! Last season, we explored a host of topics on the developer experience (something the dbt Labs crew has been pretty vocal on recently). This season, we’re expanding that theme to look at how the current data landscape is impacting the developer experience. Open data infrastructure is on the rise; AI is pushing teams to rethink how data is modeled, governed, and scaled; and the developer experience is evolving.
In this episode, Tristan Handy sits down with Chang She—a co-creator of Pandas and now CEO of LanceDB—to explore the convergence of analytics and AI engineering.
The team at LanceDB is rebuilding the data lake from the ground up with AI as a first principle, starting with a new AI-native file format called Lance and building upward from there.
Tristan traces Chang’s journey as one of the original contributors to the pandas library to building a new infrastructure layer for AI-native data. Learn why vector databases alone aren’t enough, why agents require new architecture, and how LanceDB is building a AI lakehouse for the future.
Please reach out at podcast@dbtlabs.com for questions, comments, and guest suggestions.
Listen & subscribe from:
Key takeaways
Tristan Handy: You’re the founder and creator of the Lance file format and LanceDB. Before diving into vector search and vector databases, tell us about your background.
Chang She: I love talking to analytics engineers because that’s my background. I started about 20 years ago in quantitative finance. As a junior analyst, you do a lot of data engineering and analytics, which got me into open-source Python. I became one of the co-authors of the pandas library—initially to solve my own problem of not wanting to do analytics engineering in Java or VBScript.
You worked for a hedge fund?
Yes, AQR.
Did they know you were contributing to pandas? Hedge funds aren’t known for open source.
My roommate and colleague at the time was Wes McKinney. He showed me a proprietary Python library he was working on. It was life-changing. I started using and contributing. He spent about six months convincing the fund to open-source it. This was around 2010, and they were ahead of the industry in that respect.
I didn’t know pandas started at AQR. That’s fascinating. So much of your circa-2010 analytics work was done in early pandas?
Exactly. We went through several iterations, even debated the name. Because it was a hedge fund, there was a lot of econometrics and “panel data,” so Wes named it “pandas” for panel data analysis.
That origin story isn’t widely known. You then founded two companies, sold one to Cloudera, and were there during an interesting time.
Wes and I created DataPad—cloud BI before cloud BI really took off—and sold it to Cloudera. I spent about four and a half years in the Hadoop “big data” world, where I met my co-founder. He worked on HDFS at Cloudera, and several ex-Cloudera folks are at LanceDB today. After that I moved into machine learning at Tubi TV, working on recommender systems, ML serving, and experimentation/AB testing. That exposed me to embeddings. We dealt with videos, poster art images, and synopses—data that doesn’t fit neatly into pandas or even Spark data frames. That inspired me to build better infrastructure for these data types—what we now call “classical” machine learning—which led to LanceDB.
So that’s our bridge to vectors. You experienced these problems at Tubi, then founded the company. And Tubi used dbt?
Heavily. Thank you for creating it—it was critical to our stack.
Give us a non-technical intro: what are vectors used for?
Many people focus on the latest models and techniques. My perspective: everyone has access to similar models—your differentiation comes from your data and how effectively you connect data to AI. Vectors are a way to represent any kind of data in a form models understand: high-dimensional arrays of floating-point numbers—1,500, 3,000 dimensions, etc. Early statistical models might have a few interpretable dimensions; now you can have thousands where individual dimensions aren’t necessarily interpretable, but the space captures semantics.
Beyond RAG, vectors power internal model representations, recommender systems, and personalization—the original mainstream use case.
Search is also a good use case. How is vector search different from full-text search or Command-F?
Full-text search (e.g., Elasticsearch) returns documents containing the exact terms you searched. If you search for “customer,” it finds “customer/customers,” but might miss “user,” “adopter,” “organization,” etc. Vector search uses dense representations where semantically similar words and documents live near each other in high-dimensional space. Search for “customer,” and you get results that include semantically related terms.
Would you combine vector and full-text search?
Yes—hybrid search. Early RAG demos often used pure vector search for speed. Now enterprises need production-grade relevance. Many combine keyword and vector search with a re-ranking step to reach higher precision/recall.
Early RAG pipelines often chunk text, embed, and call it done. But more thoughtful pipelines do something closer to feature engineering, right?
Absolutely. Thought goes into what you feed the embedding model. For example: add a document- or section-level summary alongside each chunk before embedding; include multimodal features—artistic descriptions, literal captions, tags; create multiple embedding columns (e.g., different prompts/modalities) and search across them with re-ranking. High-quality retrieval requires feature-engineering-like decisions before embedding.
Let’s talk vector file formats (Lance) and vector databases (LanceDB). My crude belief: a vector database is a standard database with additional indexes. True?
Not wrong, but my hot take: with Lance and LanceDB, we’re building a lakehouse for multimodal data that includes vectors. Many “vector databases” are optimized only for vectors and struggle with other data types and workloads. The category needs to evolve—either toward new-generation search engines or new-generation lakehouses. We set out from day one to build the broader lakehouse, not just a vector index.
Outline your AI-enabled data lake vision. I’m familiar with Snowflake and Databricks’ lakehouse. How do you see the world differently?
We assumed everyone would use Parquet and tried for months to support AI workloads—search, training, preprocessing—on it. We couldn’t make it work well. Talking to computer-vision and ML practitioners, no one had something effective. That gave us confidence to build a new format.
In AI you manage vectors, long documents, images, and videos. The first problem is storage. With Parquet, mixing wide blob columns with narrow metadata columns leads to out-of-memory issues due to row-group design. If you shrink row groups to fit blobs, read performance tanks.
Even once data is in Parquet, AI needs random access and secondary indexes. Parquet doesn’t support efficient random row access: retrieving scattered rows forces reading entire row groups. With media, that’s prohibitively expensive—both for search and for training (e.g., global shuffle). Data evolution is also hard: with table formats like Iceberg, backfills often mean copying entire datasets. Copying petabytes of media is a non-starter. These issues motivated Lance.
I have a good mental model of Parquet with structured data. With images or video, do you put them in blob columns?
Yes. We use Apache Arrow types. Images/audio/video are large binary columns. Vectors are fixed-width list columns (e.g., 1,536-dimensional). But Parquet’s row-group mechanics and lack of random access make these workloads painful.
So Lance was the first thing you built. It has solid traction on GitHub. Who uses a file format—users or vendors?
Both. Frontier labs use Lance to store training data—e.g., for image/video generation—replacing stacks like TFRecords, WebDataset, Parquet, and BigQuery. Large tech companies and vendors also build on Lance: Databricks, Tencent, Alibaba, Netflix, NVIDIA, Uber, among others.
Databricks uses Lance?
For parts of their AI-specific offerings.
You’ve raised several rounds—the format is Apache-2 licensed. How do you commercialize?
Our commercial offering is a data platform for large-scale AI production: vector search, data preprocessing, training/serving cache, and an analytics engine for curation and exploration. It supports ML training workflows and AI application development, solving the hard distributed-systems problems along the path. We partner closely with big vendors; we’re generally not competitive because goals and customer bases differ. Cloud providers seek platform consumption; we focus on an AI-optimized data platform for specific workloads and users.
The commercial product is called LanceDB, but you prefer to position it not just as a database.
Right—we’re an AI-native data platform/lakehouse for multimodal data, with Lance as the common format.
How does this space play out over the next two to three years?
Two big predictions. First, multimodal will be 100× bigger—more usage and more data. Audio is exploding; video generation is resurging; robotics is next. Second, our data infrastructure isn’t ready for agents driving search and retrieval.
Let’s unpack both. On multimodal: unlike structured analytics, where every company needs it, multimodal workloads seem concentrated. Do all enterprises really need this?
I think every enterprise becomes multimodal. Take insurance: tons of documents to digitize, extract, search, and analyze; drones capturing images/video to assess risk and improvements over time. Existing businesses become more efficient; AI-native entrants gain structural advantages. Multimodal data underpins both.
It’s a heavy lift. Will every Fortune 500 insurer build these capabilities in-house, or will vendors package them?
Likely both—just like analytics engineering emerged as a role, with adjacent talent re-skilling. We see the same with AI engineering.
What titles are hands-on with your product?
AI researchers and AI engineers. Many app developers building AI features now carry the “AI engineer” title.
On agents: how do their access patterns change platform requirements?
RAG was one-shot: ask, retrieve, answer. Agents iterate: they decompose problems into sub-questions, refine queries and results, and run many steps in parallel. Load skyrockets—humans type slowly; agents can issue hundreds of queries simultaneously. Queries are more varied and selective, and agents are creative in combining modalities and sources: schemas, SQL over structured data, prior analyses and charts, document stores, image/video metadata, etc.
Traditional vector databases aren’t designed for this breadth and scale. If you bolt together multiple specialized systems, your “agent stack” balloons into a maintenance nightmare. Our approach: put all data in one place with a single system that supports vector search, keyword search, filters, key-value lookups, re-ranking, analytics, and efficient random access—on top of an AI-native file format (Lance).
For listeners whose curiosity is piqued, any resources you recommend?
Chang She: Yes—our blog series by Weston Pace, the tech lead for Lance format. It dives into encodings, I/O, and has great reads for analytics engineers: lancedb.com/blog .
Chapters
00:00 – Intro: Analytics meets AI
03:20 – Chang’s background and how Pandas began
06:40 – Lessons from Cloudera and metadata
08:30 – Multimodal data and LanceDB’s origin story
10:00 – Why vector search matters (beyond RAG)
12:00 – What are vectors and why do we use them?
15:00 – Full-text vs vector search
18:00 – Feature engineering in AI use cases
21:15 – Lance format
28:00 – Storage, scale, and the problem with Parquet
35:30 – Building a business on open source
41:00 – Two big bets: multimodal data and agents
46:00 – Every company will become multimodal
50:00 – Agent access patterns will redefine data
54:00 – Why dbt-style workflows matter now more than ever
This newsletter is sponsored by dbt Labs. Discover why more than 60,000 companies use dbt to accelerate their data development.

