Airbnb's Metrics Store. Plotting in Code. Systems Design (and Company Strategy!) in ML. [DSR #251]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Following up on the “metrics layer” / headless BI post from last issue, it just so happens that Airbnb has published the first major update on their progress towards this architecture in (I think?) four years..! It’s a fantastic post, and if this is a topic you’re interested in I highly recommend reading it.
The one thing that I think is not reckoned with in this post is that Airbnb has essentially built the entire beginning-to-end data stack on their own using internal engineering resources. I’m certainly not saying that this is a bad decision for Airbnb, but it is noteworthy insofar as not every company is Airbnb, and software maintenance is expensive. In the build-vs-buy question, I think adopting best-of-breed modular solutions (open source or commercial) is the way to go.
Which gets to what is so hard here. You can build a data catalog that’s aware of your internally-built metric system and integrates with it tightly—same for an A/B testing tool, BI tool, etc. But you can’t really buy one off the shelf with this same property. And since the “metrics layer” sits right in the middle of everything (see the “Minerva Data API” box on the diagram above), integrations with the rest of the stack are critical. If we wanted to solve this in a more off-the-shelf way, how do we get an entire ecosystem to consolidate around a standard?
A simple framework for founders & investors to reason about the three types of defensible ML companies.
Such a useful framework, with lots of solid thinking hanging off of it. Highly recommended.
Very very into this train of thought. I have long felt like the exploratory process should be able to be more naturally done in code than in GUIs, but I’ve been really unhappy with my work in frameworks to-date. They haven’t seemed to enable the rapid, natural iteration that I want. I need to spend some time and play around here…
This is very interesting. The problem: how do you create a system that makes predictions from both a) image data, and b) associated dimensional data contained in a data warehouse, all the while preserving data privacy of the raw images? The answer: pass the images through a neural network to create embeddings, store those embeddings in the warehouse to join them with the rest of the data, and then train another neural net for the prediction task.
This is in and of itself an interesting solution, but I was particularly taken by the meta-point the author makes in the conclusion:
Solving for this specific ML problem, given the constraints we were working with, involved splitting up what is generally viewed as an ML-only problem (fine-tune a model) into a system design problem. That’s the biggest lesson we’re taking forward, and we regularly ask this question today: how can we break up a big ML problem into smaller, more manageable components?
I’m really so interested in this thread. Even five years in to curating this newsletter, most content written in the space is about “how to solve an ML problem” and not “how to build an ML system.” I think the latter is the bigger bottleneck to getting more ML deployed in production and I would love to see it get more attention.
The whole post is excellent but this part made me LOL:
The IKEA effect refers to the phenomenon that people attribute more value to products they helped create. It turns out that this effect applies broadly to all kinds of products (furniture, cake mixes, toys, etc.). What I am conjecturing is that the same effect is predominant in companies with a strong engineering culture. An engineering team that built their own ML Platform from the ground up, flawed as it may be, will attribute more value to it than if they just bought something out-of-the-box from a vendor. They give it a fancy name, write blog posts about it, and everyone gets promoted.
Want one more post of a similar bent? Check out What’s Wrong with MLOps?—it’s even more curmudgeonly (but not wrong).
File this under “why didn’t it exist already!?” I remember way back in my grad school days we had the conversation “how do you know how many clusters to use?” and my stat professor said, roughly, “just do what seems to fit the data.” Since then I had never seen a simple explanation of a quantitative approach to this question, and loved reading this very straightforward post on it.
Thanks to our sponsor!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123