Ep 27: "To Move, or Not to Move" (Data). That is the Question.

Diving into the possibilities of distributed query engines + the data mesh architecture with Justin Borgman of Trino / Starburst.

May 20, 2022

Justin Borgman is the co-founder, Chairman and CEO of Starburst, and has almost a decade spent in senior executive roles building new businesses in the data warehousing and analytics space.

In this conversation with Tristan and Julia, Justin dives into the nuts and bolts of open source Trino, also exploring how to get a data mesh without making it a mess.

Listen & subscribe from:

Show Notes

Key points from Justin in this episode:

The creation of Presto starts to edge up against the Starburst story. So can you tell the Presto into Trino, into Starburst story?

So 2014 Teradata buys Hadapt. Myself and my team become part of Teradata. I became a VP and GM focused on really emerging technologies. I like to say that I had the miscellaneous bucket. The other folks had the core Teradata data warehousing appliance. I had kind of the everything else that didn't quite fit into the core business.

The gentleman who was the sponsor for acquisition, a guy named Scott Now who's the president of Teradata; I reported to him and he gave me a very long lease to really explore the future of data warehousing analytics. And it was in that context that I met the guys at Facebook who had created Presto and Presto was still very early. It was open-sourced in 2013, we got to Teradata in 2014. So it was really only used by like a handful of Silicon Valley internet giants; Airbnb was early to use it and Netflix was early to use it. But a very limited you know high-class audience of internet hyperscalers.

And the theory that we had was: could we actually reinvent Teradata around the future of sort of data warehousing analytics being data source agnostic? And to me, what was most exciting about Presto was not just that it was like a better SQL engine for Hadoop, which I think what a lot of people thought of when it first was like, "okay, here's another sequel engine. It was a little better". But it was really a sequel engine for anything.

And that to me kind of hit a nerve because we saw a lot of customers who were in transition and had a variety of different data silos. They were maybe on-prem moving to the cloud, moving from Terra data to Redshift; Snowflake was in its early days at that point. And this represented a way to potentially access everything and create what I like to call optionality, really giving customers the freedom to decide what all those different data sources are going to be at the end of the day.

That kind of began a collaboration, which I think a lot of people don't even know about the history of Presto is that Teradata engineers were actually like making Presto great at the time. Facebook was as well of course. But the Teradata contribution, I feel like never fully got the credit it deserved. And along the way, our team material data became some of the leading contributors. If you look at like the top 10 of all time, half of them are basically Teradata people because of the contributions that we made during that period.

And so by 2017, we had started to see some real momentum behind the open-source project. More companies were able to use it. You didn't have to be Netflix or Airbnb to deploy it. You could be maybe a more mainstream company, and that represented an opportunity for us to go out and try to build a business around it.

What was a bit unusual, although you have some interesting beginnings as well as a company, is we started as a bootstrap company. We didn't raise it to venture for the first couple of years and just basically serve the open-source community, supported them, started to build enterprise features, and created an enterprise edition. The first 10 months was just strictly support contracts.

What are some of the trade-offs that you have with Trino? Because you don't have the ability to also change the format type on the storage layer you have to work across a wide variety of different types of databases.

So that's a great question. So first of all, I would say with respect to Databricks, we're probably a bit more philosophically aligned, with Databricks than we are with Snowflake in the sense that Delta is just another format for us that we can query and collect statistics on and do the same types of analytical queries on the Delta format.

And so the real answer there's question is we rely on the underlying formats that we're querying to organize the data in a performant way. So, to Tristan's earlier question about the evolution from like flat files, CSV files the JSON files that people were querying and the very early days of Hadoop to Parquet and Avro and Orc which are columnar in the representation you can get faster, read performance out of those, to now like the newest generation, which is Delta, Iceberg and Hodie. Those are to me like maybe the final frontier of this open data format evolution, because those now allow for updates and deletes and addition to performing querying.

So I think there's been a convergence over this day decade + period of time where the distance between the performance you can get out of a data warehouse and the performance you can get out of a data lake started off being like way, way, way apart. And I think actually just one of the things I personally think Cloudera got themselves into a little bit of trouble off sort of overselling that in the early days when it wasn't quite there yet. But it has caught up in a significant way. You can get very, very similar performance even with these open data formats, as opposed to controlling a proprietary format yourself.

Have you seen any patterns that have worked really well in your customers like thinking about where to actually keep their data separate versus moving it to a centralized one?

When you give people a lot more choices you're giving them more opportunities to do the wrong thing sometimes. So we want to write a blog post about this kind of like when to leave it and when to move it? We haven't written it yet but maybe around the time this podcast comes out first.

And it varies a bit. I mean, I think this is actually something I'm curious to get all of our solution architects and volunteers. At Starburst we are working with customers every day. But you know, I can make a couple of high-level conclusions, which is that I do think for highly curated data that you have a real performance SLAs; those are definitely cases where you probably want to pull it into and we would argue data lake just because we think again, that gives our customers more, more freedom.

You could put it into a data warehouse. We would have a preference for a data lake and store it in one of these data formats that we were talking about, whether it's a Parquet, Orc or Delta or Iceberg, etc. I guess that's my short answer.

What are some of the things that you've seen that must happen both from a technological side and from more of an operational side to get a data mesh not being messed?

Great question. And you characterized it exactly the right way that it is. It is people. process and technology. It's not strictly technology, and we remind our customers of that all the time, as much as we would love to say like, "just buying a Starburst and you've got a data mesh", it's not quite that simple.

So on the technical side, we do think having a query engine or a query fabric as some might call it is a valuable piece of the equation because it opens up the ability to access decentralized data. But there's also this notion of kind of really putting more autonomy and control in the hands of data producers themselves, and allowing them to really play a much more meaningful role in the analytical ecosystem than they have historically, because traditionally data producer produces the data and then just like throws it to a central data warehousing team who probably doesn't understand the data.

I mean in my like 10 or 12 years of doing this you know, I can't tell you how many times our customers in a central data warehousing team, don't actually know what queries are being run or what the use cases are because they're just focused on the technical aspect and don't really understand necessarily that the data itself.

So data mesh helps to solve that problem as well by allowing the domain owners the people who really understand the data to play a role. So we think part of making that happen is starting to think about data as a product. And we've been evolving our own product in this direction to allow data producers to basically curate their own datasets, and sometimes these are materialized views across multiple database systems as well. They can then be shared in public with the rest of the organization and consumed. So that's one step of, again, how we're trying to help further that but part of it is just, again, kind of like a new way of thinking and you have to get more and more people on board.

Sometimes central IT could feel threatened by a data mesh because they can say like, "well, am I giving up too much control too, to the edge?". I think central, it still has a really important role to play by managing the infrastructure of this data mesh. But you're putting more power in the hands of the data producers to really curate the right datasets, which is ultimately going to drive more consumption I think across the board and further this notion of self-service and data democratization that everybody wants.

Where do you think we are actually in the cloud adoption for the data ecosystem? And what do modern data stack zealots just get wrong about the realities of really large companies?

So I'll answer the first one first. It'll be a guess maybe an educated guess, but I guess.

I'm going to say we're somewhere in like the 50/50 range in terms of the move to the cloud. Like I think we made a tremendous amount of progress in the last 10 years as an aside.

When I was doing that first company, people were talking about cloud, it was actually a very popular buzzword, but probably 1% of the data was in the cloud and in 2010 or 2011 or whatever. So I think we moved a major way, but I do think something like half of the data is still sitting on-prem. And I think the thing that people don't realize, or sometimes maybe underestimate is the tendency towards heterogeneity. I guess it's, it's almost like the natural evolution towards like chaos in the universe. In the sense that like different departments are going to do their own thing, you're going to have new application developers creating their own database somewhere, you're going to buy companies and inherit their data stack.

This notion of like everything living in one clean central place I think is something that has never been possible in history, even if you look at what Teradata tried to do over 40 years, and I don't think it will be possible in the next 40 years either. And part of that data will maybe continue to live on-prem particularly for large enterprises.

There's also actually like the counter-trend. And I'm not saying that the world goes this way, but there are also some very large companies that realize that they're spending way too much money and then go backward, right? Like there are even a few examples of that.

So I think we're just going to live in a very heterogeneous world maybe forever. And I think even in the cloud context, even companies that are fully in the cloud, we see them say we're just going to be an AWS shop and then they go buy a company and now they have data in Azure and Google too. So you're going to have multi-cloud as well.

The Analytics Engineering Roundup

Discussion about this post

Ready for more?