Ep 30: The Personal Data Warehouse (w/ Jordan Tigani of MotherDuck)

Flipping the vision for "data apps" on its head: what if, instead of having data run round trips to a cloud data warehouse, we just bring the user's data to their machine?

Jul 01, 2022

Jordan Tigani is an expert in large-scale data processing, having spent a decade+ in the development and growth of BigQuery, and later SingleStore.

Today, Jordan and his team at MotherDuck are in the early days of working on commercial applications for the open source DuckDB OLAP database.

In this conversation with Tristan and Julia, Jordan dives into the origin story of BigQuery, why he thinks we should do away with the concept of working in files, and how truly performant “data apps” will require bringing data to an end user’s machine (rather than requiring them to query a warehouse directly).

Listen & subscribe from:

Show Notes

Key points from Jordan in this episode:

BigQuery is easy to set up for a side project and also able to process enormous amounts of data quickly. What was the like origin story, which was you better at first? Did you have to get really good at the other muscle over time?

At the time it was during the era when big data was the big buzzword of the day. And so we wanted to make sure that BigQuery worked really well for big data, and that's what we saw as the goal. That's why it was called BigQuery.

I think it had separated storage and compute, it was serverless, and it had a lot of things that sort of became popular pretty much later. But the question of the question about what size we were targeting. Like we knew that we wanted it to work on very large size datasets and that was what we mostly advertised. But we also wanted it to work on smaller datasets as well. I used to call it any size data because it's really data size-independent on how well it actually works.

And, for a long time, your queries would return in a couple of seconds, whether it was a hundred terabytes or a hundred megabytes. Everything took the same amount of time. But all large datasets start out as small data sets and pretty much all large data sets are generated over time. And we wanted to get people in when their data sets were small and then, some of those data sets are gonna grow much larger.

And I think it was surprising to us how many people have small data and continued to use small data. We loved having those users and some of the most like kind of outspoken proponents of BigQuery were people who like there was one guy, "he'd brag I spend only $4 on BigQuery a month and I get all of this value out of it".

So I think that's actually super exciting to me and it has remained one of the things that I've been fascinated about. So you can think of BigQuery as two products, it was BigQuery for crunching the giant datasets, but it was also easy-query. It's just you walk up to it, you load your CSV file and you run your queries and like you, you don't have to pay a whole lot of money.

And I think that as the industry evolved. I think the former, the BigQuery was emulated or at least has shown up other times, and there are other people that are doing, large-scale data processing and parallel data processing. But I don't really think anybody has done an easy query, and that's, surprising to me.

Athena was, is trying to do something similar. But other than that I don't think anybody is really building this sort of just walk up and run your query. Something that's always running, is always available, but you only pay for what you run.

But BigQuery was also really built with native integration to other services in the Google suite. What do you think the boundaries are of what a warehouse should or should not do? And are there places where you'd say, BigQuery is not good for that use case

When BigQuery was started, I don't think any of the people working on it or almost any of the people working on it didn't have any idea what a data warehouse was, or why it was interesting. And if you said we were gonna build a data warehouse, they would've said, "Alright, I'm gonna change teams. I'm gonna go work on something else because I don't want wanna build a data warehouse. That sounds really dull".

So we built what we thought was useful. We thought was gonna be good for users and that we saw there was a need to process data and query data and to have something you can do with your data while it's in the cloud and not have to move your data locally. And it was only several years later in fact that somebody said, "If we just added a couple of features, it would be a data warehouse, and we'd be able to compete for data warehouse workloads". We're like, "Oh yeah, that's a great idea. Data warehouse, there's a lot of opportunities here".

But I think we always thought of ourselves as something more than just a data warehouse. I think what's interesting is that if you draw a Venn diagram of like needs, the data needs are much broader and more encompassing than what is traditionally a data warehouse. The box that you draw around data warehousing is really the box you draw around what the technological capabilities were 20 years ago. So when you break down some of the barriers that constrained what you could do in a data warehouse - the separation of storage and compute and sharing across users, new organizations - it turns out that I think the need is, goes further and is much, much broader than a data warehouse.

I think I gave a talk at Strava a few years ago called "Data warehousing is not a use case". Is it like, you should go from what the use case is, what people care about, and what people care about is actually a lot broader than just data warehousing. But I think data warehousing is also just a nice shorthand for, "Hey, I got, I have some data that I need to, I need to ask them questions of my data and I wanna display them in a chart or a report or something". Data warehousing is a really nice kind of way of like understanding the system that's going to enable you to do that.

What are your thoughts on data apps? What needs to happen in order for data apps to really become more mainstream?

So I think it's a great question. Just last night I was at a data council dinner and I was talking to somebody. This person was saying, "I have a startup and we're trying to handle data apps". And as he elaborated it's all about machine learning and bringing machine learning to users and I'm like, "Wow that's not what I thought of as a data app". Like I thought a data app was something else. And I think it's not that this person was wrong. It's an overloaded term that everybody applies what their lens is to it.

My particular lens - and that comes from kind what some of the patterns that I've seen - I think is best summed up as somebody described it to me as, like more and more services, these days are collecting more and more data on behalf of their users, and they wanna share that data with their users. So sharing data with users basically means some sort of analytic. If you have more than, a handful of fields, you wanna be able to graph it and chart it and slice it and let people do drill-downs. And I think that this is becoming more and more common and it shows up in a bunch of different ways.

I think one problem around data apps is like somebody says "What's the kind of prototypical data app?". It's hard to say what it is, but it shows up in so many different places. An obvious one might be a financial app where everybody wants to chart their portfolio or see like how various stocks have done over time.

But like I was talking to somebody who that they're doing the kind of a web-based CI/CD, and they wanna show to their users how long does it take to do your builds? How many, how many poll requests? How many bugs are found? How many unit tests? And like all these things that people can slice and dice, that's analytics.

And so the next part of your question was do we have the tools that we need to handle that? And I think in general, the answer is no, or at least the tools that we might think of first for using that are probably not the right set of tools. For example, I don't wanna pick on Snowflake, but like BigQuery for example, concurrency can be limited, and I think concurrency is limited because the hardware involved is limited. In some cases like in Snowflake when you get high concurrency, it lets you expand the hardware you use. Of course, you expand the hardware you use. But that costs you a lot of money if you end up having to spin up multiple warehouses like you're setting money on fire at an ever-increasing rate.

So I think in general, the kind of data warehousing systems are not really designed for the kind of concurrency that you would expect for a user-facing application.

Latency is the other one. In an end user-facing application, end-users are used to virtually instantaneous experiences and I think people have, various studies have shown that for every hundred milliseconds of latency, like some of your users are just gonna get bored and go and do something else, or at least they have a measurably worse experience.

So I do think you need different tools. The question is, should that data, should the analytic tool be the same as your operational or as your kind of transactional database? That's certainly what we were trying to say, "Hey, why have multiple databases when you could just when you could just have one database?". And I think that this is actually a use case where HTAP does make sense.

For database people, HTAP is often seen as the holy grail. In general, trying to combine your transactions and analytics is not a burning problem. There are basically two different buyers for those, the application: the application developer is generally in charge of selecting the application database and then the analytics team, the data team, the CIO or the CDO, or whatever wants to want is in charge of selecting the analytics stack.

If you try to sell one database to both of those people is gonna be difficult. But I do think in the data app case where you have the one, low latency database that you can do transactions on, you can do analytics on certainly makes it easier, especially if you can do the analytics on the latest data. The other tricky part is what users are expecting. They make a change, they do something different, then that shows up in your analytics. In Strava, if you go for a run and you wanna see like your fastest run times or the fastest people on a certain segment, you want that run to show up there and if it doesn't you're gonna be annoyed.

One thing that often is useful, especially in a data app is you really have each one of the end-users of the data app has a separate segmented dataset and all of their data, all their queries, all of their operations should be over that subset of the dataset.

We talk to somebody who was running like 10,000 RDS instances so that they could, basically each one of their users got a separate RDS instance. And in general, that's prohibitively expensive. So what they'll often do is they'll run one big RDS instance or one big database. Then, for their largest customer, and then they have a long tail of small customers and those customers get packed into a kind of another subset of instances. That's a lot of work. It's a lot of manual work.

So I think there's an opportunity to give something, give people something that basically can scale to the numbers of users they have and give each one of the kind of an isolated environment that scales to the size of those users.

Do you wanna spend a little bit of time describing what web assembly is?

Web assembly was, I guess originally designed to be able to run random stuff in the browser by basically compiling it down, or running a virtual machine. It's been built really well so that you can get almost near bare-metal performance on it, or at least that is trending in that direction. But you can really run like almost anything in the browser. And then actually it turned out to be useful for other cases as well.

So there's a single store they're using it. They're running it server-side in the database, so you can run any arbitrary code in the database. Some databases have JavaScript like BigQuery had JavaScript, other people are doing sort of Python and it's easier to do some of these heavily interpreted languages. But it's quite hard to do it where you could run virtually any machine code, C, C+ code and do it in a way that is both performant and secure. Security is obviously quite tricky.

There are some things in web assembly that are still being figured out, like what's the right way to access data and pipe data through it? But I think there are a lot of really smart people working on it. And I think it's really compelling in a way of being able to sandbox computations. And I expect to see it showing up in all sorts of interesting places in the future.

We have kinda like SQLite is the transactional version of it that runs really well browser side, but there's not a good complementary analytical database that can do it browser side. And this is kind of the promise of DuckDB, which is rods on a single node. Can you give us a better sense of this approach?

Absolutely. In BigQuery, one of the mantras that we had was, when data is big, you wanna move the compute to the data rather than the data to the compute. And that drove a lot of design decisions that we had.

But one of the things that it turns out is that, and I think as you were mentioning, you use BigQuery but have relatively small data sizes. Most people don't actually have big data. Like most people's dataset sizes would actually fit on a single machine. By default in BigQuery, you get 2000 cores and 2000 slots. And like 99.9% of BigQuery users, basically everything fits within 2000 slots. But it's about a half-million dollars worth of BigQuery a year.

So all that computing power could fit on a single machine. So like everybody looks at, scale-out databases or scale-up databases. That's so old school and all the cool kids are doing scale-out these days. But it turns out that you'd actually really don't need scale-out. When you do scale-out, it just adds so much more complexity, latency, and all sorts of things that you have to do that make your life so much harder. Whereas, if you scale up, you can make something much, much simpler.

And I think DuckDB is a great example of how you can have a really fast, really robust analytics system in a single node.

More from Jordan:

You can also find him on LinkedIn and on Twitter at @jrdntgn.

The Analytics Engineering Roundup

Discussion about this post