Discover more from The Analytics Engineering Roundup
Ep 5: Erik Bernhardsson on the Missing Tool in the Data Team's Toolbox
Erik wrote that we’re in for a “reorganization of the factory,” a shift in the way we build data products. What should be reorganized, and where will it take us?
Erik Bernhardsson spent six years at Spotify, where he contributed to the first version of the music recommendation system. After a stint as CTO at Better.com, he’s now working on building new infrastructure tooling for data teams.
In this wide-ranging conversation with Tristan & Julia, Erik dives into the nuts and bolts of Spotify’s recommendation algorithm, (paradoxically) why you should rarely need to use ML, and the fundamental infrastructure challenges that drag down the productivity of data teams.
Listen & Subscribe
Listen & subscribe from:
Key points from Erik on his time at Spotify and opportunities for improvement in data infrastructure.
Did you have any idea what Spotify would become when you joined, or did you just really like music?
I had no idea what Spotify was going to be and the only thing is I actually love using it but a lot of people at Spotify didn't really love music. I felt like I was kind of an odd ball there. When Spotify was like 500 people, I think I was like number one in terms of employees listening to music at some point.
But why did I join Spotify? The real reason I joined it was just that there were a bunch of people from my school that were all super smart and I just wanted to work with them. And I felt like "I don't know, maybe this is going to work out, maybe not, but either way, I'm going to work with smart people.”
That seemed like a desirable thing in itself. And then it did work out, so I'm very grateful for that. But that wasn't really like part of the equation when I joined, so I guess it was lucky.
Do you think we're arriving at the ultimate solution for orchestration, or is there yet another wave coming?
To me, I'm kind of thinking more generally about infrastructure as a whole, not like workflow scheduling specifically, but I feel like workflow schedule is sort of symptomatic of this.
When I look at all of these tools, infrastructure is just like an annoying thing you have to do in the end. Like, you write the code and have these scripts running locally in your computer and you run them in sequence and kind of build them without any sort of thinking about workflow scheduling. And then, in the end, you're like "oh I gotta productionalize it again. I gotta figure out Docker, containers, Terraform, Kubernetes, Airflow, or whatever."
This to me reflects poorly on all sorts of infrastructure - the fact that it's like an annoying thing in the end. Why couldn't it be something that helps us build things in the first place?
And maybe that's something that dbt does well. dbt actually gets the feeling that people aren't first writing SQL queries in a text editor, and later copying and pasting them into dbt.
They write them as a part of dbt from scratch because they feel like they're getting more stuff done doing so. And to me that's a sign of a framework. It helps people get more productive.
I think that this is something that generally applies to infrastructure as it exists today, especially in the data world. Infrastructure still is a thing that happens at the end, and I think what infrastructure has to become is something that actually helps engineers get more productive.
To me, that's the next sort of generation of data tools, whatever that solves for that.
Could you share a bit of background on your contribution to the Spotify prediction model?
That's kind of a small thing. I built basically the first version of Spotify's music recommendation system, and large parts of that is my code that's apparently still running it from what I hear. I guess the big idea with the Spotify music recommendation system was to do what's called matrix factorization.
The whole idea is you're going to embed everything like in vectors. And when I started doing that many years ago, there weren't really good tools to work with those vectors. So I ended up writing this like SQL library with Python bindings to high dimensional list neighbor searching.
And that was like something I open sourced many years ago and I think Spotify still uses it and a bunch of other people still use it. That was kind of a fun, little open source project, but ultimately it's like a small sort of component in a much larger machine.
Was a feature store a part of that process?
No. I feel that feature stores are like a later thing and I'm actually a big proponent of feature stores, but features, to me, tend to be more like "what's the user's age? How many items did they have in the shopping cart?" Those tend to be like what I think is like features, whereas what I was doing was more like vectors.
I guess they're sort of related in a way, and I think probably vector models can benefit from feature stores.
I don't think there isn’t any prevalent tools these days, but it's clearly something a lot of companies want. And I think a lot of what holds us back, especially real-time machine learning, which is in many cases not training the models - training models is relatively easy in many cases, you just take some data in a data warehouse and train it. The hard part is how you productionalize with real time predictions that are updated online. And that's where I think that the feature stores are super important.
At Spotify, we kind of cheated the whole thing. We just computed the music recommendations every night, based on whatever data we had in the data warehouse is kind of cheating. But as soon as you want to start doing anything in real time or near real time, feature stores are really critical.
We didn't do any of that at Spotify, but I think it's a very interesting space. I think it's going to be interesting to see Tecton and a couple of other people working on it. So I think it's an interesting area. I'm keeping an eye on it.
You're a machine learning person. So why say that ML isn’t important for most startups?
I think maybe part of that is because I did all this machine learning stuff. So I've been saying this like semi jokingly that part of the benefit of knowing a lot of machine learning is that I know all the stuff that machine learning is not very good for. And I think you should use basic stuff.
Maybe the other reason is I kind of got to like ML out of my system. I did a lot, so I never felt the desire to necessarily make things more complicated than it has to be when there's a simple SQL query that does the same. But I don't know if it's a paradox necessarily - I think it's a whole toolbox.
I've had a long career. I've learned a lot of different tools in the toolbox, and I'm glad that occasionally there's some machine learning thing that comes up and I can train a little model that predicts something. But in a lot of cases that's not the right tool, and I think that's the most important thing. Don't get too obsessed with the tool, focus on the goals.
Looking out 10 years into the future, what do you hope to be true for the data industry?
I think the most important thing is just like engineers and data teams being a lot more productive. And what is it gonna take to get there? I don't know, but I think a lot of it has to do with infrastructure and tools.
I think engineers and data scientists are spending so much time configuring stuff, provisioning things and waiting for things to finish, or building things that 35 other companies have already built. And to me, that's like a massive waste of human potential.
So a lot of what I hope to see in the next 10 years is like some sort of level of building tools that people can just buy and use off the shelf that abstracts away all that stuff. So people could be focused on what is actually useful, which is the right business logic, right? Helping the business get closer to it.
Links mentioned in the post:
More from Erik
We highly recommend reading Erik’s blog at erikbern.com.
You can also find him on Twitter @bernhardsson.