Ep 6: Caitlin Colgrove (CTO @ Hex) on the Magic of Building Data Apps Instead of Reports

Give your brain a stretch, and imagine the possibilities of a Google Docs-style real-time collaborative workflow for data analysis projects.

Caitlin Colgrove is Co-founder & CTO at Hex, a data workspace that allows teams to collaborate in both SQL and Python to publish interactive data apps.

In this conversation, Tristan, Julia and Caitlin dive into the possibilities that real-time collaborative notebooks unlock for data teams — what if our collaboration style looked more like Google Docs than a Git workflow?

Listen & Subscribe

Listen & subscribe from:

Show Notes

Key points from Caitlin:

What's Hex? Walk us through the problem that it is solving.

So, in some ways we have all this background building analytics tools in general, but in some ways the genesis of Hex is actually a very acute specific problem, and that is trying to marry sort of the accessibility and shareability of traditional BI. So, as you know, no one ever has a problem building a Tableau dashboard and sending it to someone, or building an Excel spreadsheet and sending it to someone with the analytical power of something like a code notebook. 

There was basically Barry's previous job quite frankly — Barry's our CEO. He was literally looking to try to buy something that did this. All of the data teams were working in Python code notebooks, Jupyter notebooks — it was actually a pretty sophisticated data team that had pretty heavy operational work that required a lot of this backing analytics — but the executive team just had no ability to actually consume these things. What are you going to do? Email them a Jupyter notebook? 

So they were falling back on all of these copying and pasting charts into decks. And we saw this over and over again in the different teams that we talked to. The more sophisticated teams would be doing things like they would take their notebook and push it up to whatever their cloud system was and run it on a schedule and dump it into Snowflake. Probably, that was like flowing into dbt or something, then downstream was Looker, and that's how they would actually get all of the analytics done. And it just seemed like there had to be a better way to do this. 

So, that's how Hex was born. It was really the combination of the realization that these analytics workflows were deeply collaborative at the very heart of them, both between the different analysts and also between analysts and the rest of the company, and that the tooling to do that just really hadn't caught up with today's modern code-based analytical workflows. And so really that's the problem that we're trying to solve. 

What's about the notebook framework that you think has so much staying power? Why has it been so popular over the years? 

I think code notebooks are probably the best tool out there today for a bunch of workflows. Basically, these rapid, interactive, iterative analysis — you have Jupyter, you have things like R studio, you have Mathematica, you have a bunch of these tools that were basically developed around this and some of the pioneers in this space that have these huge user communities around them — and, quite frankly, they're the best things out there to do particularly the exploratory part of that. I think that's going to remain true for a while, but I do think that there's a class of workflows that are emerging, especially over the last couple of years, and then we see them growing into the future. 

While notebooks are good for a certain set of things, the existing technologies today don't necessarily map exactly onto what people want to do. I think a good example would be something like cloud data warehouses. If you're using a Jupyter kernel on cloud data warehouse scale data, I think you're going to have a lot of problems doing that. So, trying to think about like "okay, well, how can we take this notebook paradigm, which is so powerful, so fast and so flexible, and move that a little bit into these sort of technical analyst workflows that we're seeing a lot today?"

Notebooks get a bad reputation sometimes because they're not really collaborative. So what exactly are you doing in Hex to overcome this problem?

Yeah. I mean, there are a lot of things that we've built for this. I'll just highlight a few of them and I'm happy to talk about any other ones that might be interesting to you or that you've sort of seen in the product. One thing I do want to highlight is versioning and version control, which is something that is notoriously difficult for notebooks.

We have basically a first implementation of that by default in the product. It's fairly simple and straightforward.It doesn't have all the bells and whistles of Git, but it basically gives you things like the ability to roll back, the ability to what we call "publishing", which it's basically merging a master is effectively how it's implemented. So you have some of these basic controls around what is actually live and subcontract around you're not losing all of your work, and things like that. 

I think there's a couple different models of collaboration — that's more on this sort of asynchronous side of collaboration. We also have commenting and things for review and pull request type workflows. And then, on the other side, one of the other major features that we have which I think gets more to the sort of like liveliness and sharing, is we have real-time collaboration. 

We can talk a lot about the details of real-time collaboration, what it's useful for and what it's not useful for. But the fact of the matter is in 2021, people just expect as a baseline that anything that's hosted in the cloud is going to be syncing between different computers and editable by multiple different people at the same time — Google Docs has been around long enough so that's just sort of a baseline standard. So we actually built that in from the get-go in terms of our architecture. It's all based around this kind of real-time collaboration framework that we have.

Our real-time collaboration is more around the other properties of the document at the same time. So let's say that you're working in one cell and then I come in and I add a cell at the bottom. That works just fine. Or even if we're in a meeting and we're talking about this notebook and I changed something, it will show up on my screen as well as your screen. 

It turns out that tech is there exactly the same as the tech that you would do for sort of multiplayer editing. A lot of the sinking stuff just can be reused there. So, that's the level of multiplayer editing that we have and we support. And we also have things like you can see someone's editing the cell right now, and a few things like that. 

But yeah, I agree that some of the really simultaneous stuff like the type of stuff that you see in Google Docs, where you have two cursors in the same paragraph, we actually don't support that, partly for technical reasons and partly for user experience reasons because we don't actually think people should be really doing that. 

There are multiple people working on the same Hex app at the same time, and that's difficult to pull off technically. How hard was implementing that?

Yeah, that's right. It's quite challenging to build and it's even more challenging to build after the fact. 

So, pretty early on, we knew that if we ever wanted to do this, we had to build it from the get-go. And this actually wasn't even the very start of the code base, it was maybe like six months in. So we already had a fair amount of stuff that we'd written. 

And yeah, in designing that, the tricky thing about a lot of these algorithms is a couple things. So, there's a few different ways of doing this out there. 

There's one algorithm for doing real-time collaboration that's called Operational Transforms — Google actually sort of pioneered this model with a lot of their Google Docs work — and then there's a more recent model called Conflict-Free Replicated Data Types or CRDTs. And in a lot of ways, you're starting to see a lot of things based off that — like Figma does something around this — and a lot of the newer stuff is using these.

And part of the reason this was challenging was because we did a whole big set of technical research around like “what are the technologies we're using?” —  like GraphQL and React, and various certain things — “what's the support for OT libraries out there and what's out there for CRDTs?”. 

And having used OT in the past, it's very verbose. It sort of has a lot of baggage with it — at least the implementations that I've used. But a lot of the stuff that we were seeing with CRDT was just not quite ready for prime time is sort of how it felt. And plus, on top of that, nobody had ever really done anything with Apollo and GraphQL, which was our main state management framework. 

So all of those say that we had to do something sort of a little bit novel there, looking at a bunch of different prior art from Figma and Observable, and a bunch of these other places. And we ended up with something that was a little in the middle. It kind of takes some of the aspects of OT and some of the aspects of CRDT, simplifies them a little bit, so we give up things like multi cursor editing but we gain the ability much more easily to reason about the state of the system. And, yeah, we actually went and then implemented that and we've been building on it ever since.

I think in some glorious future, I would love to open source this library because I think it could be so useful for anybody who's using GraphQL, which is not natively real time by any stretch of the imagination, it's kind of plug and play, but it's not there yet. We'd love to do that at some point. 

What's the magic behind Hex? What does it unlock when you build one of these more interactive experiences?

So, I think that there's a couple places where I see magic when we show this to people for their first time or when they've been using it for a while. And I think the two things I would call out there are in the building of the dashboard, or not dashboard but app — I guess it's a more accurate term of describing some of these things — it's just the ability to go hook up to your data, write a couple of lines of code, put it in a thing and then ship that out within 10 minutes. And people who have never seen anything like that are just like "oh my God, I spent five hours making this like a slide deck, and now I can do that in 30 seconds".

And then, on the other end, I think part of the problem with dashboards in some ways is that they don't actually integrate super well into the actual work to be done. Like, you look at the dashboard and you're like "okay, well, here are the numbers and now I need to still go and do a whole bunch of other things based on those numbers". And so they take some chunk of the workflow out, but they're not actually solving your problem. 

Whereas, if you build a data app, you can actually build in and automate out somewhat like a bunch more in a much larger swath of the workflow that is coming out of the analytics that you're doing. And that's where you get some of these magic things.

Like, people who are like one man show on their data team being able to build a whole bunch of these complex data workflows out because they can automate it in Hex in ways that they wouldn't have been able to if they'd just been emailing someone a dashboard. And then that person's like "okay, based on X, Y, and Z I need all of these other things and you have to go do them yourself".

It feels like we're at the beginning of something new here with Hex. Is this a new category that you're defining, like the collaborative data app?

Yeah, I do feel like it is a bit of a new category. I think similar to the way you all have been talking about analytics, engineers, and analytics engineering as a new discipline. It's not like any specific part of that if they're roots and other things, but this category of work didn't exist 10 years ago. And it has rapidly sort of developed into this broad discipline. 

I do think that there's a similar change happening in the more analytics side. So not just kind of in the data engineering, in the pipeline and in the transformation side, but also in analytics and maybe just like a year or two or three kind of behind and how it's and how it's evolving. And that is basically as I was saying earlier: use a code driven tooling to do the analytics workflows that before would have been point and click, they would have been in Tableau, Looker, Excel, all of these things, moving those into more of an engineering discipline. 

I certainly haven't seen a lot of things out in the space that are conceptualizing it like that. And I do think there are products that do great jobs at slices of this workflow. Like, there are great collaborative notebooks out there, and there are great SQL editors out there. But thinking about it as a discipline and thinking about it as a set of really connected workflows is something that we think a lot about at Hex, and I haven't heard a lot of other people conceptualizing it in the same way. 

And I do think that leads us to a lot of really interesting opportunities to innovate in that space. And kind of look beyond a little bit what we're seeing in sort of existing tooling and stapling together a bunch of existing projects and think about "okay, well, if you were to reinvent from scratch what this population of users actually needs and wants, what would that actually look like?". And that's some of the things that I'm most excited about looking at in the coming months and years. 

More from Caitlin:

You can find her on Twitter @crcolgrove, and check out Hex at hex.tech.