Discover more from The Analytics Engineering Roundup
Ep 23: The Bundling vs Unbundling Debate
Can Tristan, Benn and David make any sense of this very open question?
A debate has erupted on data Twitter and data Substack: should the modern data stack remain unbundled, or should it consolidate?
In this conversation, Benn Stancil (Mode), David Jayatillake (Avora) and our host Tristan Handy try to make some sense of this debate, and play with various future scenarios for the modern data stack.
Listen & Subscribe
Listen & subscribe from:
Key points from Benn Stancil, David Jayatillake and Tristan Handy in this episode:
Sarah Krasnick writes: "How do you define the modern data stack? And if you're using one and, who is the author of the data engineering? Weekly newsletter writes modern data stack has a set of vendor tools that solve niche, data problems, lineage, orchestration, quality with the side effect of creating a disjointed data workflow that makes data folks' lives more complex". What do you see as kind of the core of this conversation?
So, I think partly I think it's historical. I think people remember a time when those data stacks are more consolidated, whether that was like Teradata, Oracle, Microsoft, and they think, "oh, that was better in a way, because even though it couldn't cope with what we needed it to do, it was all in one place and had much fewer tools".
But having said that, it didn't work very well. It wasn't scalable. If you think about how things like web development and software engineering have gone, they have a very complicated stack of many different tools that join together to deliver the best solution for any given organization. And I could see us going that way.
My view is that there was original, as David said, basically a breaking apart of a few big tools into kind of their poles that the major pieces got pulled out. And so there's clearly like this kind of unbundling I suppose, like the monolithic BI or monolithic warehouses.
Like it blew up and now there are tons of companies looking for places to like insert themselves. And so, I think I agree with the idea that sharing broken apart may be better than one giant thing. But I think we've gone probably way too far, mostly because there's just a bunch of money chasing problems. And I think that's going to create a very difficult experience.
I don't think that's a bad thing. This is what happens like this chaos is part of the natural process of technology improving. But I think that for now, we have to figure out how to deal with that mess. I don't think just because it's like we have individual pieces of technology that are good. That means the overall experience is good.
The way I would put on the bundling and unbundling bit is it's a little bit of a semantic argument kind of for the sake of like coming up with something and say. In reality, it's just like we do data differently. We're kind of reconstituting. The thing that what we had before is just being like rethought and pieces are going to be unbundled pieces are going to be bundled.
I don't think that in software engineering land in the two-thousands, when you were seeing the rise of web frameworks. I don't think that there were arguments about bundling and unbundling.
I frequently find myself in a lot of these debates, thinking that like it's not actually one or the other, it is how do you get both. I think about this from like a composability standpoint and like creating experiences for users, as opposed to like trying to win technology.
We're not trying to build technology. Like, yes, technology is part of that, but at the end of the day, this is a big problem and it's more akin to me. Why isn't there a CRN stack? There's not, there's just Salesforce built a better CRM than everybody else. And the tack on a bunch of stuff. And everybody bought a Salesforce.
Like this is a bigger thing than Salesforce is a bigger thing than the CRM, but I think it's closer to that experience than it is web development frameworks, where it was just technology.
Time and time again, we see that infrastructure is best built by open source and open standards. And you tend to find a winner, take all solution at a layer of the stack and everybody uses it and it's open and just build on top of it from that point forward. So I think that we're kind of trying to build commercial solutions in infrastructure in a way that like historically infrastructure doesn't really get built. What do you think?
I think it's really interesting that it links back to how things are composable and Benn's point on can everyone just pay 50K for every little piece of the stack? I think, no. But with the advent of lots of these things being open source with the advent of lots of these things adopting a product-led growth model where they're not costing 50K they're costing 10K or less.
Suddenly, and if they do compose well with data roads of Apache arrow, right? And then could it actually be workable much how you can pick and choose from the Rails ecosystem.
So this made me think of actually Rails, but in a different way. So Rails, as I understand, it is a model view controller framework.
The view is a commercial product that you buy, that you put in your credit card and you log in to whatever product and you get to look at it. And that's just commercial products.
The model to me feels the actual infrastructure it's warehousing, it's things that. And those cases it's sort of open source, but you don't care because you're actually just paying for computing and storage. You're paying for infrastructure in a commoditized way. The controller is the interesting layer, and that to me, I mean, dbt, very squarely sits in that. So they trusted, you have much better perspective on this than I do that.
One to me feels where the open-source frameworks actually make the most sense. That while they'll kind of model part of it, okay, yeah, you have some open, it's all Postgres or something under the hood, but not really. The view is, I mean, the controller feels the piece where the open source frameworks really make a huge difference because it's translating between how do we basically make the data layer modeling part of the stack.
As Databricks developed their more pythonic ML type functionality and big query also ruled out an integration with serverless spark, is it possible to imagine a way that all these things kind of start to look more similar than different?
Yeah, I think so. And I think this goes back to the point Benn was talking about the model that in that model view controller relationship, and I think the model of the data is more complicated because, inside that web backend, you're just talking about the relational database that's obstructed or no SQL database. Whereas for what we have to do in data, we have to build these really complex layers of like entities and metrics and semantics that then are usable. And I think that's where the model is deeper in data than it is in software engineering in many instances.
But yeah, can more happen inside that model? Yes, I think so. And even if it's just kind of to the left and to the right of the DAG where things could happen before and things could happen afterward, but it still fits within it. That's still just that alone is hugely powerful compared to what we have today. And you've got tools like fowl entering that space enabling that space that are quite exciting.
I guess the first thing is Databricks, to me, it feels like the biggest $40 billion mistakes. They're obviously great, they're all gonna be registered. But it feels like they try to market this thing as like this complicated piece of technology that nobody quite gets. And the way that I always think of Databricks is it's a tool that someone smarter than me use it. And like everybody I know seems to have that same thing where it's like, "I don't know how to use Databricks, but people who are smarter than me starting to do, and they seem like it's powerful". And Snowflake basically sees to come along and say, "Well, we built Databricks, but it looks like a SQL database". Everybody can use that.
The question that I have on that as like in this compute thing is there anything stopping these companies from essentially just saying, "Hey, we store all of your data in one, basically it S3". We put a bunch of different engines on top of Snowflakes and the traditional engine is just like a SQL thing that looks like Postgre, which I don't pay that engine.
Like, rather than this kind of, "Oh, what just park integration. It does all these sorts of things. That's like this kind of confusing monolithic enterprise piece of software". To me, what I really want is just like all my data lives in one place. I can connect to it with different compute engines that speak different languages.
And right now that's kind of what like Databricks does. It's kind of what Spark does. But in this way, that feels very hard to get your head around. And so I think like we actually could just solve this by saying, "Hey, actually, we just had a way to connect to it with Python", and I suspect Snowflake will do this.
How far apart can dbt pull compute and storage? Like, can those be different companies? And is there anything that actually long-term stops dbt from being a compute layer that sits on top of S3 and cuts out Snowflake out of this entirely?
So the answer to like can DBT do X. I think that like our preference is to do as little as possible when it comes to computing and storage as little as possible limit zero.
I spoke at the Subsurface conference recently, and it was on a panel with Ryan Blue from Iceberg. This is, towards the edges of my like deeply technical knowledge but Iceberg is a table format. It's not a file format parquet is a file format. But, Iceberg is a way to organize a series of parquet or other files in cloud storage and have a unified like metadata E way to figure out what file to go to when, when you run a certain query.
So that, that kind of feels like a big part of what a database does. But the interesting thing is it actually doesn't have a SQL or other endpoint to connect to like do the processing. It doesn't have the compute layer. It's just the file. It's just the table layer. And, and so Dremio, the host of Subsurface loves this because they are a compute layer that doesn't natively have they don't have a strong opinion about how you should be storing data. And we've talked about this stuff as data lakes historically.
But the interesting thing about the historical data, like paradigm, is that it's just been a computer paired with a shit ton of parquet files. But I think that we're actually getting better at this table level, which I think is the right abstraction to pair the storage and compute with. And you've also got Snowflake is supporting Iceberg, I think others are supporting Iceberg. So you could start to imagine, then you were just saying like, I've got all my data in this one platform and I can access it via like multiple different engines.
And I think that that is like certainly an approach and my guess is that, you know, all the data platforms will want to move towards that world. But I think that there is this other approach of like, all my data is. And Iceberg tables and I can have two contracts. I can have a Snowflake contract and a date of the birth contract. And actually, each one of them is a little bit better at like the thing that they are best at. And they're both reading and writing the same set of files.
Do we sort of change the whole dynamic of what these software products look like where enterprise products start to look more like app development, where they can be built with small teams, they don't have to all be venture funded? Like people can just build them with a handful of people and like make pretty good money in doing that.
So I think it is because when you've gone for like that product-led growth approach and you don't have to have a huge amount of money to fund building those apps. A product like if Avora is already a collection of data apps, like two or three data apps already.
So if that's plugged into one of those app ecosystems, and rather than having tens of customers, you've got thousands of customers. Yes, maybe even one tenth of what you charge may be what you'd might charge an enterprise customer.
The economics still can make sense, and the friction for the customer through just "Oh, I want to connect this Avora anomaly detection app to the App store. And then my data flows to it and I get to use it. That lack of friction is amazing. There's some huge value in that.
I'm excited about this future for the different ways that you'll build companies. Folks may have followed our journey over the years, but we really believed at the outset. Well, once we started taking dbt seriously as a product, we were hoping to build it towards a model that looked a lot like Basecamp, the company.
So a couple of dozen people at a lot of credit card swipes. And it turns out that that's just not how data works because the enterprise data space is really where a lot of the total dollars are in aggregate and enterprise just because the data is so highly sensitive that you have to plug into to do this stuff, you have to go through legal and sales and all of these things. And so we were naive and we updated our priors.
But in a world where. In a world where these types of guarantees in the enterprise are coming from the platform or the app store instead of coming from the individual vendors, we could have built a totally different company. I mean, there's a non-trivial number of conversations that happen at dbt Labs today that are about a set of things that are not actually about the experience of an analytics engineer. They are like, when are we going to launch a multitenant control plane in the EU? It's incredibly important. There's a huge number of human beings who, like really want that from us. But it has literally nothing to do with the fundamental innovation, and there's many, many, many companies who are like solving the same problems over and over again.
I have a little bit of experience in this, but it seems like we've seen that change a lot in tools that aren't some verticals that have changed a lot that you can now sell to the enterprise without going through that process. And I kind of wonder if like data can get there and that's basically to me, what this is like, can data get to the point where the consumerization of data IT. Like, is that a thing that can actually happen? I don't know. Maybe. But it certainly seems like we're inching closer to it.
Do we have concluding thoughts around bundling and unbundling?
I do think there will be some bundling, right? There are some things which are being done very similar disciplines. Like if you think about ETL and Reverse ETL, it's possible for those things to become bundled because the discipline is very similar it's just directionally different. Observability discovery could be bundled, right? Those sorts of things because, again, the discipline is so similar. It's about metadata management and lineage and things like that.
And I think that's where maybe the use cases have become. So fine-grained that, did they need to be that split? Sure. There could be some bundling, but then I think there's that almost like a horizontal. Bundling that could happen. And then like the vertical bundling think that could be more that kind of app marketplace thing. But I think anything and anything and everything will happen at the same time. There will be full bundled solutions like GCP, for example, that have a full bundled solution. And that then that could be a completely unbundled and there will be a complete as there is today solution as well.
Yeah. Like, I dunno, we just detonated a bomb and the whole thing.
Like my view is it doesn't make sense to be like, oh, what's going to happen is. We're going to bundle airflow with computing tasks, or we're going to bundle data dictionaries with durability, or we're going to unbundle ETL. And to me there, it doesn't make any sense for us to build a roadmap on top of something that is in such flux.
There are some places where it's like, yeah, you can kind of point these things look like they may be coalescing together or whatever. But I think the places where I think there is enough mass that they don't split apart is like data pipelines, ELT, warehouses and storage though.
Apparently, maybe that's coming down the road, but I think for a long time like that'll. That'll stay on a transformation governance layer.
Beyond that, how those things split apart? I don't know. But certainly like at Mode, we're not going to make big bets on that shaking out in one particular way or the other, because I think like too much as in like the dynamic is too chaotic for us to make a whole lot of sense of it or to like predict one way, which is maybe a boring answer.
That even though dbt is sitting at the center of a lot of this, I don't want to come down on one side or the other of this question either. I think that later today as we record this later in the day, I am going to attend a product demo where some folks internally have a hacked together early experimental snow park support, and we've got dbt python models and participating in the dag and all of this stuff.
And so, okay, great. I'm not trying to say that this is about to ship tomorrow, but I'm very curious about that experience and how it will feel, and if it will be as magical of an experience as dbt on SQL has been for so many people. I really think that a lot of what we're talking about in the bundling and unbundling conversation is really things that you need a little bit less of if your graph all contained in a like single structured format. I think you don't need a separate lineage tool. Your observability challenges are a little different than they would have been otherwise.
So I'm not strongly opinionated that like that's where everything is going, but I think it's something I'm very curious about. And it's possible that could be capital V, capital G, capital T very good thing for a lot of people if that's the way that things end up evolving.