Discover more from The Analytics Engineering Roundup
Ep 8: Seth Rosen (Founder @ TopCoat + HashPath) on Becoming a Full-stack Data Analyst
Seth has broken data Twitter many times, and in his early-fatherhood sleep deprivation developed a wonderful Twitter persona as the battle-tested data analyst.
IRL though Seth is a serious data practitioner, and as Founder at the data consultancy HashPath has helped dozens of companies get into the modern data stack + build public-facing data apps.
Now, as the founder of TopCoat, he’s empowering analysts to build + publish those same public-facing data apps.
In this episode, Tristan, Julia & Seth graciously dive into spicy debates around data mesh + “dashboard factories”, and explore a future where data analysts become full-stack application developers.
Listen & Subscribe
Listen & subscribe from:
Key points from Seth…
When you started HashPath, what was going on in your head at the time? Why did you do that?
I think that I was a little bit less deliberate than that in the sense that I came from product management and data. I really split across the two throughout my whole career when I was starting consulting. And honestly, as a freelancer in the early days, I was really straddling product and data. It was kind of a somewhat of a unique area of focus for me.
So I started to really work with companies that were trying to use data externally. Maybe they were building something for customers, maybe they were trying to do data monetization, and really the intersection between data and product. And it just seemed like, "Hey, there's really something here. Companies are struggling with this. They want to be doing more". So I hired our first person and analyst — she's just phenomenal — but then I convinced my brother to leave his job at Oracle. He'd been there his whole career and I said, "Hey, Josh, come on board, let's do this. There's a real opportunity here".
I joke that when our parents visit us, it's a board meeting because we can kind of complain to our parents about each other.
A lot of the work you did at HashPath was custom work. Did you find that every company was so unique that they ended up needing to go through a custom route?
Yeah, totally. I mean, a lot of companies were at a very early stage in kind of setting up any sort of data stack, right? Even with regards to internal analytics, they were kind of not there yet. So, they would come to us with a specific problem, and even for internal analysts they'd say, "Hey, my data's messy. People can't access it. We don't trust it". And we would go back to them and say, "Okay, well, I think this is what you have to do". And I think most people who listen to this podcast know that, in a kind of modern stack, there's a ton of tools that you need. And we would go back to them and say, "Hey, actually, you need a warehouse and actually you need a tool to put data into that warehouse, and you'd better have a tool to transform that data so you can make it analytics ready. And then we also think maybe you should change your BI tool, right?". And they'd be like, "Wow, that's a whole lot of tools that you're recommending for us".
So, one of the things we did is we kind of tried to convince people that actually the tooling that exists now is the easy part, right? The hard part is “what are the analytics you have to build?” What's the analysis you have to do?
We ended up doing this quick little blog post around how to set up a data stack in under an hour just because we were getting a lot of folks saying, "Well, that sounds like a lot of work. We really need that much infrastructure to do what we want to do". And we say, "No, actually, luckily now that's the easy part and the hard part is figuring out exactly what you need to do with that data and present that's the real work. Luckily now, the tooling on the infrastructure side gets people a lot of the way there for kind of setting up their initial stack".
What is a data mesh? Why should people care about it?
Yeah. I mean, I'm kind of laughing because I am an interesting person to ask this question because I think like a lot of people trying to understand it. A lot of the concepts resonated, especially as someone coming from product management. The idea of having a decentralized ownership related to some particular part of the business. The idea that a team can kind of own something treated as a product work independently and not have everything totally centralized and kind of this monolith type product or team where there's a huge bottleneck, right?
So, the organizational concepts related to data mesh I think are awesome and make a ton of sense at a certain size organization. What I was trying to do with my tweet, I was basically thinking, well, from an organizational design perspective, it makes a lot of sense. But knowing what I know about the existing modern data stack and the tools we all know and love, how could we actually get this to work using the tools we know, right?
If we kind of bought into those principles, and I don't pretend to at all be a data mesh expert, but based on my understanding, if we were trying to kind of implement this decentralized approach where each team kind of owns their own data end to end treats their data as a product, how would you actually do it the existing tooling? And so I was thinking, "Hey, maybe it would just be that every team would have their own dbt project, right? They would all work on their own repository, there would be maybe using dbt packages that communicate with each other that ended up building their own essentially data marts in Snowflake, and they would control them, they'd own the documentation, they treated it as a product”.
So I was really just put the tweet out there and one of my questions was, "Hey, can you do data mesh with this existing data stack?", and so I put it out there and I learned a lot from people's responses. I think that's actually one of the most underrated parts of Twitter is just hearing people's responses. And I think,overall, it just blew up on Twitter because everyone's trying to understand it together, and there's a lot of concepts that are complex.
What does it take for a company to become a data factory? What does that mean? And is there such a thing as too many dashboards in a company?
Yeah, I've used the term dashboard factory in generally speaking, it's got a negative connotation. I think that there are data teams that they really kind of fall into these three buckets, right? There's the data team that is just constantly turning out dashboards, just continuously to get a new request, new dashboard, new request, new dashboard. That's kind of the data factory, right? Their first kind of gut reaction is "I'm going to spin up a new dashboard". Now, I actually think that if you're in that mentality, there are ways you can do that right.
You need to be able to make sure that when things break on various dashboards that you keep track of where everything is, that there's logic and you need to find a way to share logic across dashboards, so you're not duplicating things in the dashboard layer, right? So that kind of bucket one is like a dashboard factory.
And then there are teams that say "We're not going to touch dashboarding. We are going to stop at the self-service layer. Like, our job is to build a Looker Explore, our job is to build out this self-serve product." And then, "Hey, business user or hey, stakeholder, go and do your own dashboarding." That's kind of they've taken themselves totally out of dashboarding.
And then there's the hybrid, which I really think is the best. It's just my opinion, which is there are the company's most important dashboards. And for those, the data team just totally obsesses over them, right? Like they think about exactly what should be on them, they think about data presentation, they think about what should be the right goals to display alongside this metric? They own those and they treat them like their product and they're putting them out there for their company. And, yeah, sure, this plays on other dashboards and what’s happening within the company.
But, those are really the three camps. So I talk about dashboard factories where if you're going to have one of those, you'd better kind of really talk about quality assurance like you would in a normal factory. And if you're really just turning out dashboards without that quality assurance, well, that's just not a good place to be.
Tell us a little bit about what TopCoat is.
And one of the things that we found in our consultancy, and we talked a lot about this, is this notion of needing to choose between something that's fairly unreliable, kind of rigid, ugly, and then having a software engineer build you something totally custom. And so what TopCoat does is it really gives the analyst the ability to build these data applications.
And I think about kind of what dbt did and turned someone who knows SQL into this fully high-powered data engineer who could unlock themselves, who could start writing modular code, who could start using Sapino source code control, who could write their own tests, right? Really transforming that role and giving them these superpowers that they didn't have before.
So, when I think about TopCoat, if we can channel some of that right and say, "You can extend that, you can go all the way to the front end, you can build this awesome data application just being a data professional". That's really what we're trying to do, and we're seeing all sorts of really awesome use cases as part of that.
How does TopCoat integrate with dbt?
Yeah. So, one of the main things we're doing right now is we're actually treating the Git repository that you use to build TopCoat, we're actually just treating that as an extension of your core dbt repository, right? So you can be building, you can do all of the dashboarding, you can let the user filter, you can represent all of that in code.
But, what that means is because we're building on top of dbt, you actually extend your DAG all the way through to the dashboard, right? All the way from dbt sources, through to your dbt models, into the analysis that you're doing in TopCoat, into the dashboard that you're doing in TopCoat. You actually can fully see that deck end to end, and you can see it in TopCoat, and then you can also backwards import it into dbt as a package, right?
So, what that really allows you to do is treat this whole system as one, right? You can make dbt changes in your data mart, you can make the associated changes in TopCoat, you can deploy that whole system as one thing together, right? Rather than trying to deploy it ahead of time, fix your dashboards or duplicate your dashboards. You actually are treating end to end from data source to dashboard is one thing, and you can put it through your kind of CI/CD pipeline the same way you would just your dbt repo. So there's a lot of benefits by having the two products really tightly integrated with each other.
Links mentioned in the post:
More from Seth: