Evolving ML Models. Scaling Kubernetes. Markov Chains. Similar Item Search at Airbnb. [DSR #127]
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
The Week's Most Useful Posts
It’s good to see some awesome data content coming out of Squarespace. This post outlines how the data engineering team has been able to scale access control across their data warehouse in an environment with hundreds of users.
This problem—ensuring that the right people have access to the right data—can be surprisingly tricky. Often, organizations simply throw up their hands and give everyone superuser access, but this is not a good answer. Especially as compliance and data governance grow ever-more-important, this is stuff you need to care about. Do all users at your company need to see customer email addresses? It only takes one UNLOAD command by one user to create a lot of pain.
Word on the street is that the team at Squarespace that built this tool are close to releasing it as open source. I’ll be sure to include a link here if and when they do.
Updates from Google Research: they’ve successfully used an evolutionary algorithm to beat out reinforcement learning-based approaches to autoML. The most fascinating result (to me) is that the evolutionary approach is not simply higher performance—it requires far less computation to arrive there (graph on right, above).
This summary is fairly short / accessible.
We’ve been running Kubernetes for deep learning research for over two years. While our largest-scale workloads manage bare cloud VMs directly, Kubernetes provides a fast iteration cycle, reasonable scalability, and a lack of boilerplate which makes it ideal for most of our experiments. We now operate several Kubernetes clusters (some in the cloud and some on physical hardware), the largest of which we’ve pushed to over 2,500 nodes.
The data infrastructure at your org probably doesn’t come anywhere close to a 2,500-node Kubernetes cluster(!), but it’s fascinating to know how one of the most bleeding-edge AI research organizations in the world sets up their experimental environments. This stuff is hard.
We’re releasing a new batch of seven unsolved problems which have come up in the course of our research at OpenAI.
Sometimes the hardest thing in research is coming up with unsolved, but solvable, problems. Here are the problems that OpenAI is looking at right now; these are great indicators of what cutting edge looks like today. One of the hardest: regularization in reinforcement learning. 👍👍
Markov chains are a very useful, but less-commonly-understood, tool that you should have in your toolbox. This is a great intro.
In this blog post we describe a Listing Embedding technique we developed and deployed at Airbnb for the purpose of improving Similar Listing Recommendations and Real-Time Personalization in Search Ranking. The embeddings are vector representations of Airbnb homes learned from search sessions that allow us to measure similarities between listings. They effectively encode many listing features, such as location, price, listing type, architecture and listing style, all using only 32 float numbers. We believe that the embedding approach for personalization and recommendation is very powerful and useful for any type of online marketplace on the Web.
The approach is almost comically effective (the above pictures are of different locations!). Really interesting work, and detailed writeup.
I frequently run across data scientists that just don’t have great core tech skills. One of the most common gaps is in core networking concepts.
This post is a surprisingly exhaustive list of things that you should really know if you expect to code in a modern, networked world.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123