HTAP Databases
Do we actually need so many different databases? Or can we shove them all into a single cloud infrastructure and behind the same SQL API?
Before diving in, an acknowledgement: last week the Roundup linked to a post that was a blatant plagiarism of a brilliant a16z post from 2020. Turns out the content is still highly relevant ~1.5 years later! Matt, Jennifer, and Martin: sorry about that, very much our fault.
A Database that can do Everything.
Working at dbt Labs is super-fun. I posted this message in #roundup in our internal Slack:
Is anyone aware of a word that describes a data storage system that optimizes for multiple different query access patterns? For example: imagine a database that does both columnar and row-based storage and chooses between the two for a given query based on the query planner’s understanding of the query.
What followed was a 19-post-long thread of pure joy, which then spawned a 45-minute-long live conversation during my office hours where Natty gave us a totally impromptu deep dive into the history of databases. This was almost too much fun, and we’re going to try to get him to give the same talk at Coalesce 2022.
I started the conversation because I thought it was highly relevant for the moment in time we’re at in the industry, but I very much realized that I wasn’t the right person to write about this. So I invented my much-more-qualified cofounder, Drew, to give the topic his own personal treatment. The result is a pleasure—enjoy.
- Tristan
Computers are rocks we tricked into thinking.
That’s the thing that I remember while reading whitepapers about HTAP: Hybrid Transaction/Analytics Processing databases. Databases are just specialized computers after all, and a rock will tell you anything if you torture it for long enough.
We all know the story of the two different kinds of databases. On the one hand, you have the transactional database: good at reading and writing small numbers of rows very quickly, but bad at table scans. On the other hand, you have the analytic database: good at aggregating and filtering data over an entire table quickly, but slow for reads and writes in an application context.
There are - of course - more than two types of databases. You’ve got your key/value stores (Redis), your search-optimized databases (Elasticsearch), your timeseries databases (Influx), your streaming databases (Materialize), your graph databases (Neo4j) and a whole host of other databases that are built to service specific use cases and query patterns. The workloads that these databases can service are the results of very intentional trade-offs around how data is stored, indexed, and processed inside of each database.
While we tend to think of databases (and computers in general) as being “logical” things, they really are “physical” machines. The amount of time it takes a computer to load a byte from disk or memory is very real and very perceptible when you need to do it a billion times. The optimizations at play in a database are primarily intended to minimize the amount of data that needs to be moved around because moving data (especially out of disk or over the network) is very slow. For databases — and the companies that build them — time is money.
The hundred billion dollar question is: do we actually need so many different databases? Can we not just take all of these different databases, stack them on each other’s shoulders, cover them with a trench coat, and sneak them into an R-rated movie? On the one hand, Michael Stonebraker argues that, no: we cannot have a one-size-fits-all database. The technical and commercial considerations for these different databases are too divergent for an all-in-one database to be viable. On the other hand, Reynold Xin argued to me that actually yes, we can combine these disparate systems and unify them under a common interface for consumers. That would sure be nice, but if it was possible, wouldn’t it have happened already?
There were good reasons to want to keep these kinds of databases separate historically. For one, the database market matured in the era of on-premises computing. In the on-prem world, resource contention was a real thing — it would be undesirable if BI workloads negatively impacted transactional workloads (ie. serving the application), for example. Second, different types of databases might all speak SQL, but the query engine required to service each workload might be radically different. From a complexity standpoint, it’s cheaper and easier for a vendor to sell two different databases than it is to try to combine them into one.
It doesn’t take very long to realize that two key parts of this calculus have changed in the past decade:
The ubiquity of cloud computing
The scale of $100+ billion companies
Cloud computing won’t magically erase the complexity inherent in a unified transactional and analytic database, but it’s certainly easier to build and maintain this kind of system and deliver it as SaaS than it would be to run it on-premises. With practically infinite and very cheap storage and compute, it’s conceivable to imagine radically new approaches to database storage and query execution that wouldn’t have been feasible a decade ago. The cloud data titans certainly have the resources to try, and I think the market opportunity is there to justify the investment.
I think this is exciting! But I also think it also has the potential to be concerning for practitioners. Databases are a winner-take-all market, and while the promise of The Database That Can Do Everything Built for the Cloud™ feels compelling, there’s something to be said for best-of-breed solutions and the avoidance of lock-in.
My personal guess is that the major cloud data platform vendors have gone deep enough into the adoption curves of their existing workloads that they’re starting to think about growth in adjacent markets. I think we’ll see rapid convergence first around batch + streaming capabilities (here and here) or SQL + imperative programmatic workloads (here, here, and here) as a precursor to the eventual leap towards unified analytics and transactional databases.
Ultimately, this hybrid database architecture might be just what is needed to make every product a data product.
Like rocks, I have also been tricked into thinking this weekend. This was my first-ever Roundup, so hit that reply button and tell me what you think! This was fun, I might have to do it again. And if you want to hear more Drew hot-takes, hit me up on Twitter (@drewbanin) or dbt Slack (@drew.banin).
From around the internet…
🤓 Benn’s most recent post is about the path to teaching a billion people to code. It draws significant inspiration from Excel and is optimistic (a rarity for a self-declared curmudgeon!). I’m in strong agreement with the substance of the article, but want to take the argument in another direction.
There are three main types of Excel formulas:
single-cell inputs mapped to single-cell outputs
range inputs mapped to single-cell outputs
range outputs
These three types map directly to levels of cognitive load. It’s extremely easy to write and debug formulas of type #1 (=A1+B2
), somewhat harder to do so with #2 (=sumifs(…)
), and quite tricky with #3 (=arrayformula(…)
). The reason for this is intuitive: each successive type requires you to do more processing in your head. Remember how hard it was to reason about summation notation when you first learned it in high school math? There’s a reason this doesn’t get taught in second grade.
My guess is that 90% of people never get past type #1, 9% of people stop at #2, and only 1% of people ever get to #3. Just a guess. I personally find arrayformula()
tedious and try to avoid reasoning about it whenever possible.
Some types of computation are hard for human brains in a way that most people just don’t really enjoy. In the same way that I don’t really like running very much—it’s work! Our brains have evolved to be lazy.
None of this is to disagree that interfaces to computation are important and should be as accessible as possible. But if we want a billion people to learn to code it may have at least as much to do with instilling a love of math in second graders, with getting kids hooked on solving puzzles and releasing little bits of dopamine, with rethinking education from the ground up to be more human.
When was the moment you realized you enjoyed working through complex problems in your head? What positive reinforcement did you get from your environment that you should keep doing this?
🤔 Amit Prakash’s first-ever Substack was great; his second one maintains the high bar. What’s so fantastic about Amit’s writing is that, as the CTO of ThoughtSpot, he has a unique perspective on the bleeding edge of data interactivity. This post focuses on curiosity, which is a framing that I rarely hear used but find so incredibly helpful. Typically we think about creating higher levels of abstraction as a way to enable our less technical colleagues to answer their own questions (“self-service”). But really, we should want higher-leverage interfaces too.
Forget about the analytics engineering process for a second; when you are in the flow of trying to explore the space of a business problem, the thing you want is to minimize the time to insight. If you could write the SQL to express a question in 90 seconds but had a different way to ask that question that would only take you 10 seconds, that’s simply better. You will likely need to ask 10, 20, 100 questions before getting where you need to go—minimizing the friction for each one is critical.
Besides which, the “being curious about a business problem” headspace is extremely different than the “write a complex query” headspace. When you’re in the curiosity flow-state, you should want at all costs to stay there.
What if, instead of thinking about some tools as “enabling self-service” we thought about “empowering curiosity”?
🔥 Simon Späti argues that we need an analytics API built on GraphQL. I could not agree more, and he makes the case better than I ever could. At some point will net new data products stop connecting to databases and instead connect to GraphQL endpoints? Probably…
🧱 Furcy Pin asks whether there is a place for Spark in the modern data stack. Excellent post; my only point of disagreement is that much if not all of what Furcy says about BigQuery could as easily be said about Databricks’ SQL endpoints and Photon runtime paired with Delta. So Spark is very relevant.
The bigger point, though, is one that has been bouncing around the ecosystem over the past ~year: there is real utility in the dataframe API, and there are real challenges around just how far you can push the Jinja + SQL model. This post kicked off a thread internally about testing—one of the ways in which we experience this pain the most is that Jinja + SQL is just not that easy to unit test. (Unit testing code is very different from testing underlying data.)
Over some long enough time horizon I would be surprised if dbt didn’t support multiple different ways of expressing transformations, very much in line with this thinking.
💲 Census raised a massive Series B.
🗄️ An Introduction to Modern Data Lake Storage Layers. Data storage matters, and this layer is going to start mattering more. Great explainer on Hudi, Iceberg, and Delta.
> we can combine these disparate systems and unify them under a common interface for consumers
I strongly agree with this perspective. Any query provided by a consumer first starts as a question in their mind, expressed in the "language of the business" and not in the language of databases. Certainly they don't start by thinking "ah, I think I'd like to query our olap database X and write some sql for table T1 and view V2". It would be more like "I wonder which weekday we produce the largest number of widgets in the US". Then this idea is translated manually into some kind of a high level query plan, ie which db and tables to query. This part of the translation from business semantics to databases and tables is done manually - but the high level query is never recorded!
I think the key idea that's kind of floating around is to express the query as two distinct parts:
1. the topology-independent query - does not refer to dbs or tables or view, just to some high level semantic model
2. an elaboration expression that maps query 1 above to an actual query on topological objects
If the system maintains some semantic metadata, #2 doesn't have to be fully manually specified but can be somewhat guided. For example, given a query of type 1, the system can show various db/table/view options to execute the query and the person can pick the one they think makes sense. The main advantage here I think is de-coupling expression #1 from your implementation details. If you add a new db later, or even normalize/denormalize your existing database, you just modify expression 2, while expression 1 stays intact. It seems better to separate the business-language expression from the topological-implementation expression?
I'm not sure this means that there is only one database. However I think this implies there is one high level topology independent data model and query language, and multiple implementation databases that this can be mapped to.
Anyway, this is all "in theory", but quite interesting IMO. I often feel most of our logic is coupled to too many implementation details, while it should only be coupled to a higher level semantic model.