Ep 49: The State of Databases Today (w/ Andy Pavlo - CEO of OtterTune)
The evolution and future of databases
[Join data practitioners worldwide at Coalesce 2023—built by data people, for data people. Register today.]
Andy Pavlo is a professor of databaseology (he says it's a made-up word) at Carnegie Mellon and currently on leave to build his own company—OtterTune, which uses AI to figure out the settings to get the best performance out of databases.
He is one of the preeminent minds on databases and a die-hard relational database maximalist, and he joins Tristan and Julia to talk about the state of databases today, why there are so many specialized databases (and if we need so many), why tuning databases is so hard but important, and how the database landscape will evolve.
And check out the new podcast look!
Listen & subscribe from:
Key takeaways from this episode:
There are a lot more databases that do a lot of specialized focus in specialized workloads. Is that an accurate understanding of the last 20 years? Or have there always been a ton of specialized databases and I was just never aware of them?
That's a good question. I have this hypothesis in my head. The evidence supports the statement you're making. We are in what one could call the golden era of databases, where there are so many different offerings, and so many different choices that are specialized or tailored to different application domains, operating environments, and problems.
I know a lot of the older database academics. I've worked with Mike Stonebraker during my PhD, David DeWitt, and others. It seems like there are more systems now, but back in the seventies and eighties, it was just as diverse as it is now. However, those systems died out and we don't know about them.
The sense I got from that was that there seems to be way more now, but there's always been a bunch of different databases beyond the big three. In the 80s, it was Oracle, Ingress. In the early 80s, you had Sybase and Informix. A bunch of sessions popped up and listeners have never heard of them. Again, these things have fizzled out or died out over time.
The 90s was a bit of a, I don't want to say the doldrums, there was activity, but to your point, there was calcification around Oracle, DB2, and then Microsoft was getting SQL server up and running, but Forms and Sybase were still very big. The big difference now versus back then is that obviously we have the internet. We have things like GitHub where you can put out open-source projects and people start using them.
That allows systems to get traction by getting to as many people as possible where in the old days, how do you discover a new database system when you have to have a marketing team pushing it out there? You just download these off GitHub. That's that.
The internet has helped accelerate this proliferation. Now, I don't think it's going to last forever. For someone who builds database systems or wants to teach people to build database systems, it's kind of ironic when you say, "Hey, I think there are too many database systems." But there are a lot. I don't know whether the market can sustain so many.
I might be wrong, but I feel like there's been a lot of money being thrown around and there are a lot of companies being propped up that may or may not survive. There has always been churn.
It's a huge market, so there are always going to be people trying to build new systems. Databases are an interesting category because they're critical pieces of infrastructure and critical pieces of software; there is a lot of money and people trying out new things. If you were a new company today and said, "I'm going to build something to replace Linux" that would be a huge uphill battle. It would hardly be possible to get traction.
I'm wondering if over time access will become a Linux thing. It'd be hard to replace that. Therefore, these new upstarts or data startups, new projects, we won't see as many come out as we see today, but building a database system, at least getting something up and running very quickly, is a much easier task than getting a new operating system. It doesn't mean you did a good job on that first version you wrote, but there are enough tools now that you could build something fairly quickly.
It seems like there's a long continuous line of evolution of relational models charting back to the seventies. There are offshoots where these databases handle something that the relational model at the time did not handle. Then SQL and the relational model evolve to handle that thing and it gets gobbled back up by the relational database world. Is that true?
Andy Pavlo: Yes. The big change was in the 1980s. One of Stonebraker’s innovations was the object-relational model—the idea that you could have more complex data types in your relational model database. That's the core of what Postgres is today.
UDTs or user-defined types are part of that. At a logical level, the relational model is the right way to go. And the next question is, "Is a single implementation of an object-relational database system the best way for it?" It's certainly true with the update of the new SQL standard in 2023. Now they have native support for property graph queries. They've had support for multidimensional arrays. It's just that they're doing more and more. The only system where I would think you would not want to use a relational system or take something like Postgres and construe it to store this kind of data is multi-dimensional arrays.
That's because going back to the row store versus column store, you think of the execution engine with a row store that is designed to go horizontally across data. Whereas a column store is meant to go scan vertically in columns. But within a multidimensional array, it's this weird multidimensional direction. It's hard to build a system that you take a relational system and use that.
Julia: What kind of real-world data is best represented in multidimensional arrays?
Andy Pavlo: A common category would be geospatial. A lot of the scientific data looks like this, anything that's gridded.
You have a satellite image, you have the time that the image was taken, the sensor reading, and the longitude and latitude coordinates. Again, you can store this using PostGIS, but it may be a specialized array database would be better for this. But again, you can put SQL on top of that.
Why are databases so complex? Do you think that they're going to simplify in time?
I would say that any existing system today that has knobs, a bunch of knobs, those things aren't going away because it would be a major engineering effort to get rid of them. Just to be clear when we refer to the knob, we mean a configuration parameter that can change the behavior of the system and you want to start tuning for how you think the application is going to use the database system or use the database.
Snowflake has done a really good job of hiding a lot of complexity. Now internally, they've told me they have hundreds of knobs. So, the problem doesn't go away. It's just who has to deal with it. The reason why these knobs exist is that when the database system developer was building the system, at some point, whatever the problem they were working on, they had to make a decision on how to do something.
How much memory do you allocate for a hash table? How long do we wait before writing out the disk? Rather than putting a pound to find in the source code that's there forever, they expose it as a knob because they assume someone else is going to come along at a later point and know how to set it correctly.
Of course, that doesn't always happen. If I was getting a new system today, I would try very hard to not expose any knobs and you're better off spending the engineering effort to try to make the system adaptive to maybe start with a default value and then over time have the machine learning adjust as the query runs or something happens. But again, that's an engineering effort that's going to take time from adding new features.
The knobs are usually the cop-out. When I started at Carnegie Mellon, I was trying to think of a new project to work on, and auto-tuning for databases is an old problem that goes back to the 1970s.
I was looking for a different way to approach this problem by applying machine learning. Because I was at Carnegie Mellon, machine learning is obviously very big there, and we focus on knobs because there has not been a lot of work in this area. It really did not become a problem until the late 2000s or so.
The way that the major data spenders are trying to solve this is by having these ad hoc tools or these tools that have rules and statements based on whether you give it some basic information to spit out some parameters. We thought, "Can we use machine learning to refine this process and try to develop more customized configurations for different databases?" But then the big thing, what I wanted to do was have the tool learn from the experiences of tuning other databases to apply that knowledge to new databases. Again, all these tools that the vendors had before, you tune one database and then you tune the next day starting from scratch all over again. Then, as a part of this, we were also trying to figure out a way to be able to do this and run these experiments on real-world databases without having to access the data or the queries. As a grad student, it was impossible to get because no companies have done this thing.
Looking 10 years out, what do you hope will be true for the data industry?
Andy Pavlo: I hope I'm wrong that there's a major market consolidation for database companies, but we'll see about that.
Tristan Handy: Wait, you don't want there to be consolidation or you do?
Andy Pavlo: The more databases, the better, because they can hire my students. A lot of interesting ideas come to these different companies that it's better for everyone. But there's a report from Gartner that says that the major consolidation is in 2025 because the VC market is tighter these days. We'll have to see.
Everyone is sort of chasing after that Snowflake IPO. Databricks is going IPO, but it doesn't matter when, not if. Everyone's looking to be the next one. I don't know whether that'll be the case. I don't think the market can sustain this many companies.
Going back to my maximalist position, I don't see SQL being dethroned or relational models being replaced by anything. I have an outstanding bet where someone told me on Hacker News that the graph database market was going to overtake the relational database market in 10 years.
That's not going to happen. I promise if I'm wrong that I'll make a shirt that says graph databases are the best or something. I'll use that as my official university photo until I die. I think one of the things we didn't touch on, but I think is that hardware is going to get really interesting for databases.
I don't mean GPUs necessarily or FPGAs, but I think there has to be something else. I don't know what it is that could come along that requires us to rethink how we build database systems.