Ep 2: Venkat Venkataramani of Rockset on the Future of Real-time Analytics

Step with Venkat into a world where data is always fresh, queries run in 1ms, and analytics engineers build web-scale, real-time data apps.

As Engineering Director at Facebook, Venkat helped build the RocksDB real-time database that powered growth to 5 billion queries per second(!)—and now with his colleagues at Rockset, he's bringing that real-time database infrastructure to the rest of us.

In this conversation, Tristan, Julia and Venkat explore the fundamental technological advances that are empowering analytics engineers to enter the real-time future.

Listen & Subscribe

Listen to the full episode from the player below, or find your player of choice from the links beneath it.

Listen & subscribe from:

Show Notes

Key points from Venkat on the history and future trajectory of real-time analytics:

How Google and Facebook transitioned from batch to real-time

Even Google actually initially had to rebuild their whole search index every so often every hour or every day or whatever. That's what Google looked like in the early 2000s, people don't remember that Google went from batch to real-time, and Facebook newsfeed's first version was batch.

They would do ETL jobs on everybody's activity, accumulate all the data, build a newsfeed. Yeah. Build a newsfeed for everybody and then say, "Hey, here is your newsfeed". And it won't refresh. Now, can you imagine the world where Google search or Facebook newsfeed is not real-time? So this is a one way street.

The world is going to go there and you won't remember wait, people did that? And this was not even that long ago. People have even already forgotten because they take them for granted.

Is real-time analytics really necessary for the average company?

If you just think about, you know, people will always want like faster horses and not automobiles. I think that's my best answer but let's debug what you just said. So to me if it is extremely complex , to do real-time analytics and even if the data comes to me earlier, there isn't really anything I can do about it.

Like I, as a human, I'm not going to be checking this every minute. I'm gonna look at this once a week. I don't need this to be fresher than that. This is where I think the world is, this is batch.

This is why the minute you talk about real-time analytics, you will see it is they all become applications. They all get automated. At that point, once real-time analytics is possible without cost and complexity, you're productivity goes up because you don't even have to look at it once a week and say, holy shit, like things have been completely, oh, since Tuesday, everything has been going haywire.

All my ad spending has yielded zero. What am I doing? What would happen is if you actually go into the real-time future, on Tuesday, five minutes into that bad period, you'll get an alert saying, "Hey, something is off. You need to pay attention." Now, imagine the dev ops world living in batch.

Where you can't look at how your servers are doing. You can't get alerts, you don't get real-time monitoring, you can't do anything. You can't do the job. And so they are already in that era. They're already living in real-time. No dev ops person will say give me batch, and it's a one way street. And I actually think this is why I think once you get real-time analytics, it's not just, you know, this is what I mean by acquisition, it's not just enough to acquire data in real-time, you have to unlock analytics, sub-second analytics, interactive analytics. This is why, you know, at Rockset, we spent a lot of time figuring out how do we index data in real-time, because what's the point of accumulating data in real-time, but every query takes 20 minutes. You might as well just batch it at that point.

But if my queries come back in sub-second and my data is coming in real-time and it's everything is fresh, then I can automate so many things that I do, and it yields so much more productivity boost and it's not like all these analysts will not have anything to do.

No, actually they'll be working on the findings. They will be reacting to what's actually happening to the business and spend more time on growing their business, growing their revenue, growing their user base and what actually matters for the company as supposed to, waiting for your warehouse to return your report 20 minutes, 40 minutes later.

So to me, it's not, there will always be a place for strategic kind of reports over a long period of time. There'll always be a need for batch analytics, but for every use case that is doing in batch. there are 10 new use cases in real-time analytics, which will probably be a real-time application where you don't even think about it as real-time analytics.

The fundamentals advances in indexing technology that power real-time analytics

So this goes back to our realization, for real-time analytics to really become a thing, it's not just you're write-optimized in terms of oh, you can, really quickly accumulate data in real-time and do a million writes per second, but all queries still run like a warehouse, and warehouse was good enough to be able to do that. You don't need another one.

And we really looked at what did Facebook do to make its newsfeed real-time, what did Google build to make their search indexes real-time? And all these real-time, you know, these kinds of systems, if you look at their storage architecture real-time indexing is key, right? Like they are able to continuously get new data, index it before they're storing it and basically have a massively scalable, real-time indexing system.

So, you you look at Rockset's storage architecture, it resembles a distributed search engine a lot more closely than a distributed database. And we call it converge indexing. If you look at we can't just build real-time indexing, but we also needed to build a query engine on top of that, a SQL query engine on top of that.

And so if we look at our compute engine that will look like a distributed SQL database, right? Like it, it basically knows how to do joins scalably. We have a distributed SQL engine that you submit a query, it breaks it down into fragments, hundreds of fragments, sometimes several thousand fragments.

Within one millisecond, they get scheduled on your entire distributed system that is basically storing all of your data. In batch-based systems, they can take their time, usually it takes multiple seconds. In Rockset, it takes 1.2 milliseconds and instantly your query is starting to run as fast as it can on all of the hardware.

Again, if you look behind the scenes on what's happening when you type a Google search, that's exactly what's happening. Your search goes through and gets very quickly efficiently distributed to a massive distributed system. And then the query's results come back as quickly as possible.

So it's really this combination of both the storage system looking like a search engine and the compute system looking like a SQL engine coming together. Both of them have developed a lot, right? Like search engines and like the whole understanding of these systems has matured really well.

Looking 10 years out, what do you hope to be true for the data industry?

For entire organizations, end-to-end, the analytics stack becomes real-time.

Just like how people can't imagine Google search or Facebook newsfeed working on batch, that should be how all analytics inside the company should happen. Everything should be in real-time including complex ETL. It's going to come- whether Rockset exists or not, that is the future that I think is coming. I hope we can accelerate the timeline just a little by reducing barriers for entry.

We will be at a point where everybody takes real-time for granted, not only for their own internal company analytics, but also on every SaaS product.

If you go to a SaaS product that provides any automation within your company, whether it's marketing automation, sales, automation, what have you - if it doesn't have embedded real-time analytics, people wouldn't even use it. People wouldn't even buy it.

Somebody said "I want to build an internal app." And they said "we have a bunch of Tableau dashboards and all that, and I want to make a web app because if it's slower than Instagram, nobody in my company uses it."

All of that is what I think is in the future.

Links from the episode

It you’re unfamiliar with RocksDB, the original Facebook Engineering blog post from 2013 when it went open-source is worth a read.

The Analytics Engineering Podcast features conversations with practitioners inventing the future of analytics engineering.

New episodes are published every 2 weeks, along with the companion Analytics Engineering Roundup newsletter.

To get each edition of the Podcast + Roundup to your inbox, subscribe below: