Ep 12: Bringing Streaming Data to Analysts w/ DeVaris Brown
Can streaming analytics allow analysts to facilitate better customer (like real, end customer) experiences?
As a product leader at companies like Heroku and Zendesk, DeVaris specialized in building infrastructure-grade products. Currently, as the CEO of Meroxa, he helps enable data teams to build real-time data infrastructure with the same ease as we now take for granted in batch.
In this romp of an episode, Tristan, Julia and DeVaris flow from his experience in tech mentorship, into the nuts and bolts of Change Data Capture (CDC), and how streaming data infrastructure can help data teams provide better end user experiences.
Listen & Subscribe
Listen & subscribe from:
Key points from DeVaris in this episode:
What does Change Data Capture (CDC) mean and why is it getting a lot of attention?
I'm old enough to experience the first dot-com crash, right? I've been in this thing for probably 25 years now, and I just remember doing database backups. This is where we started off at. Like, backup your database to a tape or zip or whatever it is, right? And if disaster strikes, you pull out the tape and you replay all of those transactions. 90% of the time, that stuff fails, right?
Funny enough, even with all the technology that we have now, a lot of people are just like one mis-configured field away from having their data be lost for very long. So, one of the strategies that you can use instead of taking like running this job every hour on the hour to back up my database from a particular ID, I can literally keep track of every transaction that hits the database on a real-time stream. And this is what's called Change Data Capture (CDC). So, every time a transaction hits the database, it gets recorded into a log file, you can basically read that log file with every transaction, and then replay that transaction into another destination, into a backup destination, to a data warehouse or any of that type of stuff. You're literally just capturing every transaction as it happens versus waiting every hour to "SELECT ALL FROM orders, WHERE ID is equal to 'this’ ".
The benefits of CDC is that it gives you more granularity. So, when I write a SQL query, I'm getting the end result, and with CDC, I'm getting all of the little things that happen in between. So, if I'm an e-commerce store, I can "SELECT ALL FROM orders" and JOIN on users to see what users ordered things. And that gives me a good history of like "Okay, well, what are my transactions for the day?" I can write reports on that.
What if I'm going to do cart retention or reclamation? I want to know who actually has added things to the cart, but actually hasn't completed a purchase. What would I do at that point? I'd go get a vendor. Or I can actually just use the database that I have because every transaction is getting from user added something to the cart — there's a timestamp there — user checkout, user updated inventory, user did all of these things, user changed address. You can capture all these things in between and then get the purchase. And then, once you do that, then you can just write a SQL query that says "Hey, I want to know all the people who have added something to a cart but actually haven't completed a purchase with this particular session ID".
And you don't have to go spend six figures on the vendor to tell you that, you can just do that yourself. So, those are the types of things that you can get with CDC versus writing these like big batch jobs in moved data from one place to the next that are very prompt to fail.
Do you resonate with the low code SQL idea? And who are the Meroxa power users?
For us, it has been the analyst or data scientist, that type of person because it is super easy for them to do the data engineering job: where's it coming from? Where's it going? What format does it need to be when it gets there? And we built our platform around that.
So, if I am an analyst, as long as I have credentials, I can just log in and drag and drop and boom, I'm done. Or I can go on my CLI and just say Meroxa to connect Postgres to Snowflake and I'm done.
But the interesting thing is, and I don't know when this is going to air but let's just say air is very soon, we've come up with is like we realize a lot of these commercial tools are focused towards that audience, but the engineers are still trying to figure out like "Yeah, I might've wrote all these scripts, they're all the same to kind of help reason about this very high code environment", there's really nothing that speaks to them from a local perspective. So, one of the things that Meroxa is working on is “when they go low we go high” code. So, we are developing a Meroxa SDK, and that will give you the ability to build and interact with data pipelines via code because everybody's doing the low thing, right?
I can name 50 different tools right now that are like drag and drop and connect pipelines and things like that. But I think this is kind of table stakes, and from a business perspective, everybody kind of says the same thing "Oh, we hope we build pipelines in minutes, not months", and it gets kind of boring, right? It's like you don't do the same gag, you gotta do the next thing, right? It has to be an improvement. So, the Meroxa SDK is kinda like that, where we can help analysts and data scientists build real-time pipelines to help them do their jobs better for data analysis, building models, and things like that.
So, you gotta be able to be flexible and kind of serve those audiences. I mean, our whole thing is like people towards the edge, the people who are actually consuming this data, they need to be empowered. The pipes, that should just be table stakes. So let's build experiences that enable this like decentralized execution because they're closer to the problem, and they know how to use the data. The problem is like the data hasn't always been available to them in a consistent format, right? And that's the foundation that we've built.
Data stores like Rockset and Materialize have different properties. Do you feel your users have an affinity for these different types of data stores?
Absolutely. But it just depends on the team and the use case. So, shout out to our Arjun over at Materialize, they're doing great.
It's like if I want to build a real-time data application — just think about the end state, alright? I'm a hospitality company that wants every time somebody checks into a hotel, I want to give them some personalized offers. I could wait for my ML models to recompute and do all this other stuff, right? Or I can build an active learning pipeline based off of Materialize and kick off these workflows where I'm able to query data because now I have this real-time data at the moment, it gives me more broader context. The issue is, yes, people want to do both things, but most of the commercial tools are focused around analytics. And that's usually a kind of a batch based thing, right? Like, I can look back 24 hours to figure out what happened versus no, I want to know when this thing happens so I can react to it exactly in a moment.
There's actually a mismatch between the technology, what people are offering, and then customer expectations. And I think real-time data store and analytics kind of companies haven't really leaned into the use cases because I think the tooling has been kind of bad — it's hard to get data in, it's hard to like run these automated processes, like provide experiences — but like customers want to know: " what's relevant to me right now?"
When I was doing the seed round and everything we were talking to a large hospitality company they really wanted to do like as soon as you check in, we're going to give you an offer. But, their processes around this batch thing sometimes would take a couple minutes, sometimes we take a couple of days.
So, you think about a weekend in Vegas, right? Get there on the Friday, most of the time leave on a Sunday depending on how well I do on Saturday, right? — a Friday night and Saturday — but most of the time we leave on a Sunday. Like, 80% of the time their batch job failed, so they couldn't recompute their models. So, by the time that the person left on Sunday, that's the neighbor getting the offers.
So just by using us with the real-time data store, they were able to increase conversion over 20%, just by switching technologies, which resulted in eight figures worth of revenue, right? That's the type of stuff technical folks have to think about because customers want these experiences.
In data streaming, can you do only simple light transformations? Or are you able to do more tightly coupled to business logic transformations?
We can do both. Because we use Kafka underneath the hood, the simple thing is to use a single transformed message, right? And, basically, the transform gets applied on every single event that comes across a particular Kafka topic. That's table stakes for us.
The thing that we realized was like "oh, there's some other types of transforms that we can allow people to do". So, now you can write arbitrary bits of code — well, you will be able to once we release the SDK. Like, you can write arbitrary code at that point. So, yeah, I can write business logic, I can bring in other libraries, I can do data augmentation and enrichment, and all these things because it's literally just code. So we do that on the flop.
Every single event that comes across a particular pipeline, you just write a function, that's like a call Clearbit. Like, you can have a user enrichment function, if I'm doing something that has location data, I can call Iggy and use that stuff. Or, if I just want to dump it into a Materialize database and I can just say, "Hey, I want to just write regular SQL. I can dump it in, take this real-time stream, Meroxa connect to Materialize database, and I can write dbt on those things at any point in time, right? That's easy to do as well, right?
So you can take the code first approach, or you can take a more infrastructure, traditional SQL-based approach.
Supposing you have a very fast changing data set, can you use something like CDC or Rockset to get the answers you need?
No. We do hundreds of millions of requests a minute on a pretty beefy box, but basically a single box, and if we get beyond that, we can automatically scale. I mean, this is one of the benefits of us running a platform like Heroku and building on that. We would do billions of requests a second at Heroku, and my co-founder and a couple of members of my team, they were the architects of that. So, that's the beautiful part, having that previous experience that helped us kind of build this platform today.
So, yes, there are limits like the laws of physics that we have to abide by, and we're not violating things or anything like that, but we built a pretty good system that can automatically handle the traffic wherever it comes from and maintain performance. That's not stuff that people who are building these apps are doing these analyses, it's not stuff that you need to worry about. All you care about is where's the data coming from and where's it going? And what format does it need to be when it gets there? That's the higher level that you as a human can provide. You need to focus on that part, not how the sausage is made.
I think one of the best marketing jobs in the history of technology is Kubernetes. Why? Because people actually care about how the sausage is made if they choose Kubernetes. It's not a great developer experience.
I've invested in startups and like my pre-seed startups are worried about hiring a dev ops engineer before they even have product market fit because Google has said "Kubernetes is a thing". Like, you don't need to know all that stuff. And a data engineer, why do you need to know how to set up a Kubernetes cluster to run all these things and Docker compose, and blah, blah, blah?
Again, where's data coming from? Where's it going? What format does it need to be when it gets there? If we can get to that, then, we can start saying better experiences, better apps, better analyses, all this type of stuff. And that's really as a culture in the organization, that's the north star we need to aspire to not being more technical, putting another tool on the stack that somebody has to learn in order for them to be marketable and productive.
Looking 10 years out, what do you hope will be true in the data industry?
I hope that we are at a point of consolidation, and consolidation that's led to higher levels of productivity where I don't think people have to flame war around batch or real-time is just data.
Like, I can literally hit a button, and just know exactly that the data that I want is going to be in the format that I need, at the granularity that I need, for the provisions that I do have. And I'm just able to kind of multiplex that or distribute that out, wherever else I need.
The thing that I want the data industry to evolve to is have more of a customer centric mindset, where I am able to deliver what my customers need versus scratching my own intellectual curiosity.
I just wanted to get back to being more customer centered. And we have the technology that allows people to deliver more value for their customers versus worrying about infrastructure. I feel the best way to get there is standardization around how data is stored, how data is transported, and we just require better integrations and things like that. Regardless of my persona, I should have a pretty well-established pathway to help me provide value for my customers.
And right now, it's still kind of trying to figure out who's going to win out, right? And, in 10 years, I just hope this is just not that. I just hope people are delivering apps and experiences for their customers that actually resonate.
More from DeVaris: