Discover more from The Analytics Engineering Roundup
Ep 28: Building an Open Source Company (w/ Aaron Katz of ClickHouse)
The trials, tribulations and joys of building a company around a wildly popular open source software project.
ClickHouse, the lightning-fast open source OLAP database, was initially released in 2016 as an open source project out of Yandex, the Russian search giant.
In 2021, Aaron Katz helped form a group to spin it out of Yandex as an independent company, dedicated to the development + commercialization of the open source project.
In this conversation with Tristan and Julia, Aaron gets into why he believes open source, independent software companies are the future. And of course, this conversation wouldn't be complete without a riff on the classic "one database to rule all workloads" thread.
Listen & subscribe from:
Key points from Aaron in this episode:
What is ClickHouse? Help us understand when ClickHouse is a good choice or what kind of problems ClickHouse should be great at.
Well, let me start by saying it's essentially a database technology. So the use cases are very diverse. And I think the primary two benefits that we were hearing from the community are largely around speed, performance, and then storage efficiency for real-time queries across a number of attributes in high volume workloads.
This can be a challenge in practice for large-scale use cases. If you think of web and mobile analytics, BI, observability, and IOT, where the data set is massive, to begin with, but continues to grow over time and all of that data, the historical data and the latest data that's streaming in, needs to be queried and analyzed simultaneously.
To get the results set from a SQL query across a dataset as measured in petabytes, and to get that results in the same amount of time it takes for a webpage to load - let's say two hundred milliseconds - is really where people have found ClickHouse is unique in its architecture and performance.
Do you think that we should really think about ClickHouse doubling down - Warehouse solution to compete with other players in this space - or do you think that's limiting in how you describe the technology?
That's a great question. I'll start by granting you that, yes, when ClickHouse was originally opened sourced in 2016, it indeed only had partial limited support for SQL joins and mutable data. Since then, the team has made considerable strides in improving both of those baskets. And so that would now be a bit of a myth to claim that, and thus there's no finish line, obviously, when it comes to these specific feature sets.
In 2018, ClickHouse introduced batch updates and deletes in preparation for GDPR. Currently, our joint syntax now follows the SQL standard. It has a number of other useful extensions.
We still think, for example, that is a great job in terms of the separation of computing and storage and really making their cloud-based data warehouse service more like an API endpoint. So I think there's some pattern recognition there we could consider. But the use cases that we're seeing now, we are seeing companies that will migrate from traditional data warehouse technologies, other what many would consider legacy technologies like Teradata and other cloud-based technologies, like BigQuery and RedShift.
And then we see a lot of coexistence where there are very efficient data warehouse storage engines, but they perhaps don't offer the same responsiveness in terms of query execution. So people will migrate their data out of those data warehouse technologies into ClickHouse for aggregations and reporting.
I polled the dbt community to hear how people were using dbt + Clickhouse, And the feedback was they use ClickHouse very much like you would in another warehouse. Really stunning to me how different that use case is from many of the other ways that people use ClickHouse. Does that create challenges or opportunities for the company to build for such a diverse set of use cases?
It creates both. They're typically related as you know.
Look at observability, for example, I would tend to agree with the assessment that mining observability data for business analytics is an opportunity that is not well addressed in the market today. One reason for this is that it's simply technically difficult because of detailed observability data. So infrastructure and application logs and metrics traces is typically high volume but is only needed for troubleshooting for a relatively short period of time. Let's measure that in days or weeks.
Well, BI analytics typically requires a much longer look-back period, right? Months or even years. So there are a couple of different solutions. First, you can aggressively summarize and archive deletes the original data. But that implies you know your BI use case extremely well. Another alternative is you can move the original data to a data warehouse, but as you mentioned, it could be expensive and is a fundamentally different technology.
So I think click house and related technologies offer a slightly different answer where you can get the analytics and you can get the cost efficiency and scale because ClickHouse is cost-efficient enough to make a reasonable to store that raw data indefinitely, even if it's uncompressed size is measured in petabytes. But if you decide to cut your costs even further, let's say by rolling up or trimming that historical data, ClickHouse also provides the reporting features that you need for those real-time analytics use cases.
So I do think that there's going to be a convergence here and we're seeing it in the market today between a company that wants to use ClickHouse for observability for the reasons that we described, but also wants to do BI reporting that could span a very long time horizon and with competitive technology can be extremely expensive.
Generally in databases, you have a series of trade-offs. So if you want to be better on along a certain dimension, you have to be worse along another dimension. Can you talk about the trade-offs involved and what is the secret sauce that has allowed you to advance beyond what was previously in existence?
Let me start by again prefacing the fact that I'm not an engineer by training, just by spending a long time in the industry. I am a contributor to ClickHouse but my range is limited. And it's the exact question Tristan that I asked when I thought about creating this company, which is what is the competitive advantage? What is that? What's the secret sauce? What makes ClickHouse so different from traditional OLAP databases, which have been around for decades. And I've found a few different things.
The first is obvious, which is the column-oriented storage, right? And the performance benefits and that efficiency gains you get with that approach. How Alexa and his team have thought about data compression over the years? Both for kind of low and high cardinality use cases. We found it to be very unique and we hear that frequently from the community things like vectorized query execution. And we're not the only ones that offer this.
Vectorized query execution is when you're not simply reporting against one row or one column but you're running parallel queries across what could be thousands or millions of rows at the same time. Things like materialized views where you've got an origination table, you've got a source table, you've got results being cashed. And so for queries that you're running more than once you benefit from the efficiency and speed of the fact that. You've already run that period in the past, and you've got to a new table to run it against.
So those are a few examples that we've found that routinely come up in user conversations about what makes ClickHouse unique,
Now with ClickHouse you have one technology that can do lots of different things and you have a language that's universal to a lot of different people. Do you think we will have a convergence of this way that we can get real-time data and analytical data in one place? Is that like part of the journey at ClickHouse? Is that problem excite you?
Well, it does because you're using SQL as a standard. It enables integration with a lot of complementary technology that is also leveraged and relies upon SQL. And you mentioned Grafana as a great example.
I mean, that's the beauty of open source. It is the ability to integrate with a variety of other technologies and rely upon the community to help guide where you prioritize your work. And Grafana is not the only front end that we're thinking about writing an integration to. You've got some emerging technologies, like super scent and Metabase that we're excited about. You think about Datagestion and you think about not just Kafka, but Kinesis and Red Panda. And even SQL query engines like Presto and Trino. And there's again, a bit of overlap here, but fundamentally the use cases can be quite distinct.
And following whatever it was in 2000, there was this emergence of these no SQL databases that really rose in popularity. What I observed was, invariably, these technologies needed to write their own query language. So then ultimately many fell back on using SQL. And I think you may have mentioned ClickHouse has written in C++ - obviously, it leverages SQL as the query language.
So I do think that the industry is standardizing SQL as the primary interface, and it's really aiding in writing integrations between what I believe to be very complementary technologies. And that again, open-source enabling that collaboration between software products that reside in different companies, is really accelerating the development pace.