An ML Pipeline from Airbnb, Scalable Data Eng from Thumbtack, & Smart SQL [DSR #95]

Jul 23, 2017

This week we start off with two awesome case studies on how Airbnb and Thumbtack have built big parts of their stack. This type of infrastructure is an incredible force multiplier within an org, and these posts can be conversation starters on your teams when thinking through your roadmap.

Enjoy!

- Tristan

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

Two Posts You Can't Miss

Moving Thumbtack’s Data Infrastructure to GCP

This is an amazing walkthrough of a significant data engineering effort. I like the post so much because it puts scalability and maintainability at the very center. Thumbtack had pain around managing compute instances so they switched to a managed architecture (Dataproc + Bigquery + GCS). As a result, they get better performance and more control. But most importantly, they changed how they spend their time.

We’ve seen tremendous productivity gains across the organization by our move to managed services(…). Going forward, our infrastructure investments will be focused on further empowering our engineering, analytics and data science teams to leverage our large-scale data in new ways.

Data teams should be seriously considering the impact of their tech choices on how they spend their time. Forcing yourself to maintain servers will prevent you from focusing on what matters: analyzing data.

cloud.google.com • Share

Airbnb: Using Machine Learning to Predict Home Value

It’s a little embarrassing how often I link to Airbnb posts, but these folks do such great work. This post isn’t actually about predicting home values, it’s about how their team manages the entire ML process, from front to back. My absolute favorite part is their work on an internal feature repository called Zipline:

To make this work more scalable, we developed Zipline — a training feature repository(…) The crowdsourced nature of this internal tool allows data scientists to use a wide variety of high quality, vetted features that others have prepared for past projects. If a desired feature is not available, a user can create her own(…)

Wow. It’s hard to overstate how far ahead this is from where most companies are today. Their work on productionizing models is also impressive.

medium.com • Share

This Week's Top Posts

Cosette: An Automated SQL Solver

Cosette allows you to evaluate whether or not two SQL statements are functionally identical, in that they return exactly the same results in all possible cases. This is extraordinarily useful if you care about code readability and performance and find yourself doing a lot of SQL refactoring.

I personally think that even though SQL is 40 years old, it’s still only beginning to be well-used within analytics. Tools like this get me really excited. I plan on using Cosette extensively.

cosette.cs.washington.edu • Share

Analytics Playbook: Log Tables

Most application databases don’t store historical changes, which causes blind spots in analysis. This article presents a straightforward approach to have dbt create these history / log tables for you. Super-useful: other approaches to solving this problem require far more engineering effort.

blog.fishtownanalytics.com • Share

Robust Adversarial Examples [OpenAI]

We’ve created images that reliably fool neural network classifiers when viewed from varied scales and perspectives. This challenges a claim from last week that self-driving cars would be hard to trick maliciously since they capture images from multiple scales, angles, perspectives, and the like.

Malicious intent is infrequently a consideration in most current data science applications. Techniques will need to adapt once it is.

blog.openai.com • Share

Facets: An Open Source Visualization Tool for Machine Learning Training Data

Google’s new Facets open source project is a really impressive take on data exploration. It’s built specifically to dredge through large numbers of features and visually explore underlying relationships. Super-cool.

research.googleblog.com • Share

Girls set AP Computer Science record…skyrocketing growth outpaces boys

Fast forward to 2017. Over 29,000 female students took an AP CS exam this year, which is more than the entire AP CS exam participation in 2013 when Code.org launched. Though computer science has seen…

medium.com • Share

Using Deep Learning to Create Professional-Level Photographs

If your photoshop skills are as nonexistent as mine, this will be good news. Impressive results.

research.googleblog.com • Share

The Business of Artificial Intelligence

This is a great link to share with non-data-scientists you know—your family and friends who have no idea what you do, or your boss who could use more context about the state of the industry. Very accessible.

hbr.org • Share

Data viz of the week

Great redesign of the original. Click for the thinking behind it.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.

fishtownanalytics.com • Share

Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.

www.stitchdata.com • Share

By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

915 Spring Garden St., Suite 500, Philadelphia, PA 19123

The Analytics Engineering Roundup

Discussion about this post

Ready for more?