Discover more from The Analytics Engineering Roundup
Ep 22: One Database to Rule All Workloads?
Jon "Natty" Natkins of dbt Labs helps untangle the explosion of new database options that that have emerged in the past decade.
Will the dream of a mythical database to handle all workloads (transactional + analytical) ever become a reality, or does it violate the laws of physics?
This question sparked a hearty debate internally at dbt Labs, and Jon "Natty" Natkins joins Julia here to continue the conversation.
Natty knows databases. He began his career in 2009 as a software engineer working to build database systems at Vertica (and then Cloudera).
Natty’s career has traced the rise and fall of Hadoop, the transition from Hadoop to cloud data warehouses - and from his post on the Solutions Architecture team at dbt Labs, Natty has studied closely the new database architectures that have emerged over the past few years.
Listen & Subscribe
Listen & subscribe from:
Key points from Natty in this episode:
What do you do at dbt Labs? Tell us a little bit more about your profession and how you help out.
Sure. So I'm a little bit of a data janitor. I lead the solutions architecture team for enterprise west here at DBT Labs. And, roughly what that means is that my team works directly with our customers and prospects to help them learn about dbt, evaluate our cloud product. Really just to get business value out of what we're building. And my particular territory I get to work with some of the largest organizations in the world. So it's just a really exciting opportunity to see all the crazy ways that people use data and how they use dbt.
What was so exciting about Hadoop in the early days? Was it just that it could process massive amounts of data at a time? Tell us a little bit more about the promise.
It's a good question. And I feel like, the. What happened? The internet happened. I think the root cause of what was going on.
Because the columnar databases, they all charge like per terabyte. Which is a hilarious thing to think about now that everybody is operating at a terabyte scale. But because they were these on-premise NPP systems, you couldn't handle a ton of data in them and it was very expensive to purchase and operate these licenses. And so when the internet happened, suddenly there were — I guess I'm being facetious, the internet happened a lot earlier — but like when people really started processing data that was being generated on the internet it was creating all this data exhaust, and this is where the three V's came from like variety, velocity and volume. I think they were the three. And I think with in the early days you had just this like, large volume, and to some extent, velocity was like the columnar database days. But when people started looking at like what Google was doing out in the world and how they were using data, they're like, "Oh, I want to do that too".
And Google conveniently was like publishing all these really interesting research papers, MapReduce was the original, they have basically everything that became the ecosystem was a reimplementation of some sort of Google system. And so it allowed people to start to process a wide variety of data. So like you, you, mentioned my Substack, which is called semi-structured. And one of the things that really came about with Hadoop was the ability to deal with semi-structured data and process that and initially, that looks like logs, right? Logs have some sort of structure in that there's a log for J template or something like that. But also JSON is semi-structured, right? It's got like a general structure to it, but it varies from record to record things like that. And traditional databases just weren't good at processing that now that's a little bit less true nowadays with the cloud warehouses. But it was really a file system with a distributed processing system on top of it that. MapReduce was kind of assembly code for data processing, right? SQL is like a very high-level language from that perspective. MapReduce was like writing Java classes to process this thing.
So it enabled all of these really interesting use cases that people just couldn't tackle anymore, and I think some of it was just driven by like art of the possible, right? Like people were excited about data and what some of these big internet companies were starting to do with data.
And they were like there's value there. Like we're going to go extract that value. And in order to extract that value, we need to capture more data and we need to process more data and Hadoop could do it just a lot more cheaply, right? HDFS was like way cheap comparative compared to a Vertica or a Greenplum, or something like that. The economics were dramatic.
Cloud data warehouses really hit their stride in like 2012/2013. The big shift towards, away from Hadoop and onto cloud data warehouses came very quickly after. What did that feel like in the market?
Yeah, a really great question. And it's, it's funny thinking back to all these stops that I made, like when I was leaving Vertica and going to Cloudera, people are like, "Oh, the Hadoop is a fad". And when, I was at Cloudera and these cloud data warehouses were starting up I was sort of looking at it and it was like "Oh, like half these workloads are never going to make it to the cloud,banks will never adopt the cloud". And it was just like a progression of these like strongly held assumptions that fairly rapidly proved wrong. And I think that was honestly that's what's been really the most interesting thing for me is kind of watching what happened with Hadoop where it was this really new technology that just got caught as sort of a secular trend but in a larger secular trend.
And so like when I think about like why Hadoop rose and fell so quickly, I think it's because there was this much bigger cloud trends that Hadoop just they were, it's like, if you're surfing and you miss the exactly right part of the wave you get a much shorter ride and that's kind of what happened to Hadoop is they just hit the wave and it, it broke on them in a weird way.
So it was really fascinating seeing these cloud data warehouses started to start to pop up. And yeah, Redshift came around about in 2012, Snowflake was around the same time. The interesting thing is that like Redshift, for example, actually came out of the NPP databases, right? Like, I don't know how widely well it's known, and I don't know. I hope hopefully no one will get angry at me for me like opening the closet and showing where the skeletons are, but like Redshift was our ParAccel. ParAccel sold their codebase to Amazon before, like they kind of went down in a fire sale. Maybe this could be my hot take of the episode.
But, it's just like, it was fascinating. And so Redshift came out of that and I think Snowflake is really fascinating because like Redshift had all of the funding of AWS behind it and is still like is doing really well. And so it's fascinating to see how that ended up landing
There's a lot of variety of cloud warehouse platforms still in the market. What is the need for all of them? In your opinion, can the market still support another half a dozen winners? Or do you think that there's going to be more convergence around the platforms that we have?
Yeah, I think that's a really good question. And, just kind of coming back to a point you made in passing. I agree with you that the cloud data warehousing market here is easily 300 billion today, probably even higher than that. And you look at the net retention numbers that Databricks and Snowflake are both putting up, it's just pretty incredible to see how fast it's moving. Databricks just posted 150%. I think Snowflake last earnings were like 173. Like that's insane. It is just growing so fast.
So to that point, I think that that's a large enough market that it can definitely support multiple winners for sure. I think that the other question you're asking about these kinds of, other, different entrances is certainly interesting. And like, when you talk about like some of the ones that you talked about Trino and Starburst in general, I think is they all have interesting different strategies, right? Like Trino is really sort of like next gen federation or like it's not a virtualization system, but it sort of feels like that in the sense that you can query data from a single interface that exists in different systems. And there is a bet on the market. If you talk to them, they're talking a lot about data mesh, right? Like they're really on board with that data mesh idea because their theory is that every major enterprise is going to have one of everything, which probably isn't inaccurate, right? Like a lot of these organizations have a bunch of things, and like internally, sometimes that strategic, right? Like their procurement teams are thinking, well if we've got one of everything, we can play them off of each other, get better terms with different ones. That can be really valuable. As long as there are use cases for all these different things.
And so Trino is super interesting, but, it's also a very upmarket solution, right? Like you're not going to find SMBs adopting Trino because they don't have three different data warehouses. They buy. And so it's, it's definitely something where I think Starburst has to figure out like what's our downmarket solution? Because you can build a strong business on high-end enterprise thing. But, in long term, there's like Snowflake and Databricks tons and tons of money off of the SMB and mid-market, and that's something that I just don't think is accessible to Trino.
We ended up in this world where you have to buy multiple warehouses. So it becomes just a battle, not on the customer, but more on like workloads. What do you think about that?
Yeah. It's definitely interesting. I feel like Snowflake at it more from that workload perspective where it's like you've now got the SQL interfaces to Snowflake, you've also got snow park, which is Java, Scala, Python interfaces to the same stuff like storage. And I agree it's really hard to have one system that handles all of these things, cause they're different queering patterns, they're different. Insert and update and delete patterns ... that stuff matters. And many systems can handle some of those things, but not others.
And also like once you get into the land of the no SQL systems or like the geo distributed databases like cockroach getting into those questions of the CAP theorem: consistency, availability, and partition. It's like you start to make these, sacrifices on some of these things in order to satisfy certain workloads, right? If you want your distribution, like you better be partitioned tolerant, and you better be highly available. But you're probably not consistent because if you write on one part of the world, it's going to take a little while before it gets to the other part of the world. So you have to choose two of these things, and I think that's where the idea of having a single system to interact with all these different workloads gets really hard when you're sacrificing some of these, like properties of distributed systems. But for OLTP and OLAP, there are definitely common workloads where I think that a Snowflake could probably say, "We're going to do this workload and this workload, this workload, and we're just going to put it all behind SQL face". Snowflake has enough scale where they could probably solve for some sort of like query routing to different systems and shift that storage around. It's hard to know.
Can only one database rule all data systems?
So I think you have all these different data systems that are coming out of these different use cases. And, I think that kind of gave rise to, what Tristan was talking about a bit in his Analytics Engineering Roundup the other week. He was talking about HAPP and like, is there, can you have one database that can handle multiple workers? And HAPP, which is hyper transactional analytical processing, it sort of showed up around the same time as like the new SQLs and when no SQL is just starting to it's starting to become a term.
And it was basically like, systems that were genuinely oriented towards traditional OLTP workloads, like the stuff that you would do with just a run-of-the-mill Oracle system. But that could also do some element of analytical processing as well. So it was trying to be the best of both worlds and like some of the earliest versions of them were like VOLTD, which you basically had compiled sort procedures that you would write, which could just run transactions really wicked fast. But they could also run these analytical, types of processing. Not as well, it was definitely much more of a high-velocity transactional system.
And then, mem SQL, which is now a single store was, one of the more widely adopted, versions of interesting, like they actually started as like this in-memory row store database, but I think later on they realized that a lot of people just really wanted analytical processing on the side of that so they built kind of a Vertica clone - they might not like call that, but they were modeling it after Vertica just to sit alongside their row store. So by having those different systems, you could then like sidle them up alongside each other, and just have them perform really well. And that kind of pattern shows up all over database land.
It's always very hard to predict big waves like the cloud. Do you think this is the final wave in data? Is there going to be another enormous wave that we can't yet imagine that will replace the big systems that we have today?
I'm sure this is not the final state. There will never be some final state unless the sun gives out. But I am certainly bullish on like the icebergs and the videos of the world. I think that we've been making, like having a single system is hard.
And while I think that appeals to smaller organizations who want to buy one thing, I do think that there's going to be that disparate system. So I think that there's definitely a world where Starbursts data mesh concepts are certainly very prominent. But it also might be that you've got Iceberg that has this storage layer that can seamlessly move from system to system and in some ways, it might end up just transparent to a lot of the end-users where like they don't know that they're necessarily querying the same data. But the vendors like Snowflake and Cloudera just announced support for Iceberg, and AWS now supports Iceberg. If they're all just connecting to some Iceberg service, then it could theoretically just be another pass-through.
So it's interesting that you're kind of pulling apart the storage and the really low-level layer. And then the processing layer has that we query bid still built into it for most systems. Although things like Trino are sitting on top of that.
Maybe what's going on here is just like Hadoop screwed it up, right? Hadoop was kind of like the composition of the entire database. So everything was just a different component in a warehouse and we just have the wrong interfaces to connect them all together. So in some ways like the stack that includes Databricks and Snowflake, and Tabular underneath it and, Starburst above it, could just be the next Mulligan, right?
We got to run the first time with Hadoop and this is actually what it should have looked like now that you made the initial cloud mistakes and we understand a lot more of like what as a service should look like.
More from Natty: