An ML Pipeline from Airbnb, Scalable Data Eng from Thumbtack, & Smart SQL [DSR #95]
This week we start off with two awesome case studies on how Airbnb and Thumbtack have built big parts of their stack. This type of infrastructure is an incredible force multiplier within an org, and these posts can be conversation starters on your teams when thinking through your roadmap.
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
Two Posts You Can't Miss
This is an amazing walkthrough of a significant data engineering effort. I like the post so much because it puts scalability and maintainability at the very center. Thumbtack had pain around managing compute instances so they switched to a managed architecture (Dataproc + Bigquery + GCS). As a result, they get better performance and more control. But most importantly, they changed how they spend their time.
We’ve seen tremendous productivity gains across the organization by our move to managed services(…). Going forward, our infrastructure investments will be focused on further empowering our engineering, analytics and data science teams to leverage our large-scale data in new ways.
Data teams should be seriously considering the impact of their tech choices on how they spend their time. Forcing yourself to maintain servers will prevent you from focusing on what matters: analyzing data.
It’s a little embarrassing how often I link to Airbnb posts, but these folks do such great work. This post isn’t actually about predicting home values, it’s about how their team manages the entire ML process, from front to back. My absolute favorite part is their work on an internal feature repository called Zipline:
To make this work more scalable, we developed Zipline — a training feature repository(…) The crowdsourced nature of this internal tool allows data scientists to use a wide variety of high quality, vetted features that others have prepared for past projects. If a desired feature is not available, a user can create her own(…)
Wow. It’s hard to overstate how far ahead this is from where most companies are today. Their work on productionizing models is also impressive.
This Week's Top Posts
Cosette allows you to evaluate whether or not two SQL statements are functionally identical, in that they return exactly the same results in all possible cases. This is extraordinarily useful if you care about code readability and performance and find yourself doing a lot of SQL refactoring.
I personally think that even though SQL is 40 years old, it’s still only beginning to be well-used within analytics. Tools like this get me really excited. I plan on using Cosette extensively.
Most application databases don’t store historical changes, which causes blind spots in analysis. This article presents a straightforward approach to have dbt create these history / log tables for you. Super-useful: other approaches to solving this problem require far more engineering effort.
We’ve created images that reliably fool neural network classifiers when viewed from varied scales and perspectives. This challenges a claim from last week that self-driving cars would be hard to trick maliciously since they capture images from multiple scales, angles, perspectives, and the like.
Malicious intent is infrequently a consideration in most current data science applications. Techniques will need to adapt once it is.
Google’s new Facets open source project is a really impressive take on data exploration. It’s built specifically to dredge through large numbers of features and visually explore underlying relationships. Super-cool.
Fast forward to 2017. Over 29,000 female students took an AP CS exam this year, which is more than the entire AP CS exam participation in 2013 when Code.org launched. Though computer science has seen…
If your photoshop skills are as nonexistent as mine, this will be good news. Impressive results.
This is a great link to share with non-data-scientists you know—your family and friends who have no idea what you do, or your boss who could use more context about the state of the industry. Very accessible.
Data viz of the week
Great redesign of the original. Click for the thinking behind it.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123