Ep38: A romp through database history (w/ Postgres co-creator Mike Stonebraker + Andy Palmer)
How did datetime as a type come to be? And other factoids to use at your next data meetup.
Mike Stonebraker is a veritable database pioneer and a Turing Award recipient. In addition to teaching at MIT, he is a serial entrepreneur and co-creator of Postgres.
Andy Palmer is a veteran business leader who serves as the CEO of Tamr, a company he co-founded with Mike. Through his seed fund Koa Labs, Andy has helped found and/or fund numerous innovative companies in diverse sectors, including health care, technology, and the life sciences.
In this conversation with Tristan and Julia, Mike and Andy take us through the evolution of database technology over 5+ decades.
They share unique insights into relational databases, the switch from row-based to columnar databases, and some of the patterns of database adoption they see repeated over time.
Listen & subscribe from:
Key points from Mike and Andy in this episode:
It was really early 1970s when you started working on one of the first relational databases, and it was called Ingres. And this is before the time of Oracle, and there are only a couple of universities doing some research on relational databases. Tell us a little bit about what was Ingres, why was it really exciting, and did you have a sense that you were onto something big at the time?
A bit of context. I got a Ph.D. from the University of Michigan in 1971 and got hired by Berkeley as an assistant professor. And the ground rules for being an assistant professor is you're given five years to prove that you're a big shit. And they either fire you or give you a lifetime appointment.
And so all assistant professors I know of, including me, are on a treadmill to do something great. That was right after the time that Ted Codd wrote his pioneering paper in CICM that basically proposed a relational database system. So we read everything that Ted Codd had written, and the competition was a thing called the Codicil Report, and I could not understand it.
It was this complicated network mess. And my colleague Gena Wong and I said the really obvious thing to do is to build a relational database system and prove out Ted Codd's ideas. Now, neither of us had any experience building big software projects or database systems. And so you just start doing it.
And so we just started building Ingres, and I think the whole huge goal was to get tenure. And that was the overwhelming need here. And I guess the other thing that happened was that as you mentioned, other groups had the same idea, and a lot of people put in the first 90% of the effort to get something that they could run for God knows what reason.
Gene Wong and I put in the other 90% of the effort to get something that actually worked. And Ingres was the only publicly distributed relational database system in the mid-seventies. And it was at the right point, at the right time. So we were just very lucky. But I think there wasn't any big design.
And I got tenure!
After Ingres, you had another even bigger hit, Postgres. I actually never realized the name Postgres came from Ingres. I just took it for granted as something that we all use today. It's a top five most used database currently. I'd love to hear a little bit more about some of the early days' stories or what are some of the things that few people know about Postgres and when it was getting started.
What actually happened was by 1982, commercial Ingres was wildly better than academic Ingres, and they had 20 programmers developing it.
And it was pretty obvious that there was no possible way that the academic project could compete. So we had to do something. And that was about the same time Ingres actually stood for something besides a French painter. It stood for the Interactive Graphics and Retrieval System.
And graphics was in the name because one of my colleagues down the hall was interested in building a GIS system on top of Ingres. And we tried to do that, and it failed miserably. And it failed miserably because one of the things you want to do is point in Polygon, and writing point in Polygon in SQL is really hard and really slow. So it was pretty clear that once you got outside business data processing that people wanted, geographic types, people wanted medical, various kinds of types.
And this was all brought home to me. When one of the early commercial customers of Ingres called me up one day, this was right after Ingres had put in date and time as a data type on requests from all kinds of people. And he said you did date and time wrong.
And I said, huh we implemented it the way ANSI had in mind. What do you mean we did it wrong? And so, after the conversation, it turned out that he was in charge of a bond application on Wall Street. And for whatever reason in his world not the same amount of interest on his kind of bond during each month, regardless of how long the month was.
So when he subtracted March 15th minus February 15th, that's one month, not 28 days. Your cost, no money. April 15th minus March 15th. That's one month. That's all. He was running on a calendar with equal length months, each of 30 days. And so he needed to subtract the time you sold the bond.
He needed the time you bought the bond from when you sold it. And he wanted just to do that subtraction in the database, but it got the wrong answer. It got Julian calendar time, not his kind of bond time. So he said, why can't I overload subtraction with how I want it to look? And, of course, you couldn't do that with Ingres.
It wasn't built to have a flexible type system. And that got us to prototype the type system that's in Postgres. And the fact that we had to do something different than Ingres was why we started building Postgres. So we threw the public domain version Ingres over the cliff and started working on Postgres. We'd spent years building Ingres in C, and the last thing anybody wanted to do was build another database system in C. C++ wasn't ready. You couldn't imagine building a database system in Cobalt. And so there weren't any reasonable options. And this was in the early eighties, right? When. The Japanese fifth-generation project was getting going and saying you should write everything in Lisp.
So we said, why not? We'll try it. And of course, the problem is Lisp was the slowest thing on the planet. We got a very early version of Postgres to do something and response times were two orders of magnitude longer than they had to be. So we threw Lisp over the cliff and transliterated this code back to C.
And Postgres became the C project. But we would've loved to have done it in C++, but we were just a little too early.
So Postgres is a really fantastic transactional database. It was row-based, and Vertica was one of the first databases that kind of flipped that on its head and was more columnar in nature. Why is that such an important switch in how you think about databases, and tell us what Vertica solved in the market.
Mike really educated me; the amazing thing about column-oriented databases is the first actually good one was written back in the sixties, I think.
Isn't that right, Mike?
Sybase IQ was probably arguably the first commercial one, and it was in the early eighties. Yeah, so we did not invent columnar databases.
But Mike recognized, right? Mike and his academic colleagues wrote a great paper called One Size Does Not Fit All in Database Systems.
And it was clear to me after reading the draft of that paper that a lot of us had been using row-oriented databases which were really designed to optimize for writes. As data was coming into the database, we were trying to use it to do reads. And Oracle didn't really want to rewrite their whole system to change their storage method.
And so Larry Ellison and Mike's and my good friend Jerry, who is Mike's very first post postdoc at Berkeley, they invented this thing called materialized views in Oracle, which kind of enabled Oracle to do read-oriented stuff, but didn't really make it very cost efficient to do. And so by the time the early 2000s rolled around, people were using Oracle and Db2 and SQL server to do all these unnatural acts of large-scale read-oriented workloads. And Mike really inspired a whole generation of people to recognize that you should be using built-for-purpose data systems.
And Vertica, for us, was the first one of those. And then, we did a spin-out from Vertica that was called VoltDB. That was an OLTP system. Mike, what am I missing?
I think that at least for me, the data warehouse market really began in the mid-1990s. And it was retail guys like Sears and Walmart who began putting transactional sales data into a repository and these repositories paid for themselves within six months with better buying decisions and better stock rotation.
So the warehouse market was catapulted into existence by the retail guys. And everybody followed suit because historical customer-facing data has a lot of value. And so it shocked me when I realized that the average data warehouse, let's say the best way to design data warehouses, is to have a fact table in the middle and some dimensional surrounding it.
Fact tables are 50 or 100 bytes wide, and the average data warehouse query fetches three or four of 50 to 100 fields. And so if you organize your world as a row store you read all the rest of the data, because it's in line, you're basically in line. And so if you want to go fast, you've got to read all the four columns you need and not all the rest of them.
So you've got to have a column store, and a column store is in order of magnitude faster than a row store on warehouse-style data. And so that was the market Vertica went after with. And I think they had a fabulous product and they were led by the world-class entrepreneur
One of the things that were the most fun was that Mike and I were really lucky to work with the likes of Shilpa Lawande who ran engineering and Chuck Bear who was our lead architect and Colin Mahoney who ran Vertica for most of its life.
And one of the most satisfying things was when we'd go into a big Oracle customer, they had a huge HP Superdome running Oracle with all these materialized views and all. It was like a multimillion-dollar instance. And we'd rolled in. We had these little three-node clusters that we rolled in because Mike and I believed in shared nothing kind of hardware and very cheap commodity hardware. And so we'd roll this little cluster in, and we'd load the same data onto the little cluster that they had on this big, massive configuration. And the queries on their big Oracle take 24 hours. We would run the same queries subsecond.
Like it was disbelief. They're like, how is it possible? It's the catch, yeah. That this little thing over here can run things faster than this big ass thing over there. And we're like you've just been sold a bad bill of goods by Larry Ellison for the last 30 years. If you represent the data and the way it's going to be queried, all you need is this little cluster of $3,000 computers.
A year after you launched Vertica was when Hadoop was first released, and there are some connections between this team here and the founders of Cloudera. So I'd love to hear a little bit about some of your opinions about the NoSQL movement, the rise of Hadoop, the eventual fall, and what connections you had to the project.
Let's start in 2004 when Google wrote MapReduce. At the time, everybody assumed that Google knew what they were doing, which in many fields they do, but in databases, they were just naive babes in the woods. And Yahoo basically wrote Hadoop, which is a clone of MapReduce, a perfect clone. And the problem with MapReduce is it isn't good for anything.
And Google discarded MapReduce in 2011, I think. MapReduce was purpose-built to support Google's crawl of the Internet. And on the application for which MapReduce was purpose-built. Google decided to abandon it in 2011. Basically, because MapReduce was a batch system and Google, by this point, needed their crawl to be interactive, and so they moved their crawl over to Bigtable.
Is your assertion that even back in 2004, there were superior design patterns that, ideally, they would've used?
Oh, absolutely. So what happened was MapReduce got all this PR from “Google must know what they're doing, so this must be a good idea.”
In 2009, we wrote a paper that benchmarked MapReduce. Benchmarked the open-source version of Hadoop against Parallel database warehouse and warehouse database systems that are an order of magnitude faster. And so on decision support queries MapReduce is no good at all for all kinds of technical reasons, and that was widely known in 2009.
They there were dueling. CACM invited dueling papers on MapRreduce and on parallel decision support databases. And those papers appeared in 2010 or 11. I can't remember exactly. And so the Google guys were arguing that MapReduce worked well if you were very careful to tune it, and you were very careful to the problems that you tried to address with it and that it had redeeming social value.
And the criticism from the parallel database guys was withering: this crap is no good, it's not flexible enough, nobody wants it. So it turns out two things happened. It pretty much rendered this whole discussion moot. The first was Google abandoned MapReduce in 2012. The main proponent of MapReduce said this is not what we want anymore.
And the second thing was that all kinds of enterprises had been sold a basket of goods, saying MapReduce is a great thing. You should buy a cluster, put Linux on it, put Hadoop/MapReduce, and your enterprise programmers, and customers will love it. So lots of enterprises bought the farm and then found out that nobody wanted MapReduce in the enterprise.
It just wasn't flexible enough to do any interesting decision-support queries. So here was a system that nobody wanted that Google had abandoned, but enterprises had spent large amounts of money building out clusters to do a MapReduce market that didn't exist. So now Cloudera and Hortonworks and others have a big problem
From the Vertica world, it must have been like you've got a better thing, and there's all this buzz around Hadoop and MapReduce and everything was so gigantic in this time period. It just has to be a really interesting experience to have better technology but roll up into an enterprise and have to combat this wave of enthusiasm.
Yeah. And like we the big mistake we made, Mike is right. Mike Olson is a marketing genius, but the open source nature of the product was a really powerful dynamic. And many people were sick of being charged too much money from IBM, Oracle, Microsoft, and everybody else.
And so they were looking for something that was a bit of an open-source thing. And when I say we should have open-sourced Vertica, if we had done that, I think that it would've become the default for a lot of folks. And it was another system called Green Plumb that was similar in some ways, but arguably we were a lot better technically.
But we held on to this idea that if you've got great software, you should charge a lot of money for it. And I think that cost us, relative to the overall adoption of Hadoop. And the real shame in that whole thing is that there are a lot of people I know, Colin Mahoney feels this way when he was running Vertica, where you would get these people who bought Hadoop in as if it was a database and then spent three or four years and millions of dollars and then realized it was never going to do what they wanted, as Mike was describing. And then they were like, what do I do now? And along came Colin and Shilpa and the team and provided a system that actually worked. And I think Vertica, at one point actually, enabled Vertica to run on HDFS, which made them all feel warm and fuzzy about the money they had paid for Cloudera. But it gave them a real database query-oriented database on top of it.
And then, the other thing that happened now was that the cloud data platforms kind of came along just in time. AWS with Redshift, which was not so good at the beginning but then got better after they built their compiled version. And then Bigtable, which became BigQuery.
When Mike and I met with the BigTable team back in 2004, we were pretty committed at Vertica to having a system that had SQL on top of it. And that was an asset that had transactional integrity. Two classic things in database systems, right? And the BigTable team was like, yeah, we don't really care about transactional integrity, eventually consistent is good enough, and we don't care about SQL because we have a lot of smart programmers at Google that can write their own queries. And Mike and I were like, oh, okay. But eventually, you probably want transactional integrity and SQL. And sure enough, things caught up with them.
And one of Mike's ex-students who was at Google, we had lunch with him one day, and he was really depressed, and we're like, what's wrong? He's I spent all of my time writing queries for these bozo business people that can't write queries. And we're like we told you; if you didn't put SQL on top of it, you were going to have to write queries. I was like, I know.
So the evolution of these database systems is, in some ways, maybe for Mike and me, very predictable. And, you say we keep doing the same stuff for 30, 40 years. One of the downsides of doing that is you watch people make the same mistakes over and over again
I don't know, Mike, it'd be great to hear what you think, but we see the same kind of thing going on now with federated data systems. So data mesh and data fabric are all the rage. And we've seen the industry go back and forth between aggregated and federated over the last 50 years.
And federated design patterns are useful. But eventually, you've got really challenging performance issues unless you have some global query optimizer, which nobody has really built yet. So I'm skeptical that this next round of federated systems is going to deliver on all the promises.
It's just waiting for it to, maybe not go the way of Hadoop, but I would never want the Starburst team to hear that, but I think it's it's a bit overhyped right now. And I don't know Mike, what do you think about all this mesh data fabric?
I think back to the previous question; Google now believes in transactions, believes in SQL, and so they got religion, and they got religion when they hired some really good database people. And so I think the cloud guys were just very arrogant. They said we're really smart, we can build a database system and we don't need any database experts, we can do it ourselves. And so they built stuff that was not what they needed and they eventually fixed it. And so I think a certain amount of not invented here and arrogance on the part of the major cloud providers slowed them down by a decade.
And maybe the opposite is true of Snowflake, right? Because Mike and I are good friends with Thierry, Benoit, and Marcin that started Snowflake, and they are great database systems. They totally know what they're doing.
So resource elastic, being in the cloud is important. Now it's table stakes, that's no longer your advantage. Andy, any final thoughts from you on modern database technologies? What are we seeing that's exciting? What's maybe overhyped?
I think that Mike and I have been through so many different rounds of people moving their data around and putting it in lots of different places.
And our real mission at Tamr is to help people make use of their data, and get the data clean, curated, and continually updated so that lots of consumers can use it. And so we're ready for people to stop moving their data around. You move it to Snowflake, great. Now, make the data high quality and very continuously updated. We have a vested interest in that. Just not only because it's what Tam's interested in, but we just think it's time, right? Yeah. Data warehouses never met expectations. Data lakes were a failure.
These cloud data platforms, to make them worth the effort of moving your data in there, you have to make the data great so that lots of people and other machines can use it.
Lots of purpose-built databases. Now they're generally pretty good, but the problems are still there, and you have to move up the stack, to solve them up the stack.
That's exactly right. Yeah.
I'll close with one statistic, which is data scientists. I've talked to a lot of them and I asked them the following question: what percentage of your time do you actually get to spend doing data science? And no one claims more than 20%. The other 80% is on data preparation, data integration, data cleaning, basically, data munging.
And so, data scientists spend most of their time doing something other than data science. And so we have to get wildly better at data preparation because it's pretty much unacceptable for data scientists to spend four days a week on what they consider grunge work. And so Tamr and others are trying hard to help out.
And dbt this is why we love you guys so much.