Managing Data Scientists. Linkedin DataHub. Faster Machine Learning. Optimizing Queries. Presto. [DSR #196]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

The Care and Feeding of Data Scientists

Over the past 5 to 10 years, data science has grown tremendously. But as young as data science is as a discipline, the craft of managing data scientists is even younger. Many of today’s data science managers were thrust into management roles out of necessity (“battlefield promotions”) or because they were the best individual contributors, and many come from purely academic backgrounds. At some companies, engineering or product leaders are being tasked with building new data science functions without any real data science experience of their own. More and more people find themselves managing data scientists without the necessary toolset or role models or mentorship to do the job well.

This report aims to fill that gap—to become a resource that data science leaders (whether they’re data scientists or engineers or product managers) can use to understand how data science management is both similar to, and distinct from, other types of management and to learn concrete tips for building and sustaining their teams. It’s also aimed at anyone trying to decide whether managing data scientists might be for them someday.

Fantastic treatment of this important topic. It’s a 50-page PDF from O'Reilly, no gate.


Linkedin Data Hub: A Generalized Metadata Search & Discovery Tool

Wow. The metadata space is really hot in bigtechland. The post summarizes recent work in a single sentence:

…tools developed in this space include AirBnb’s Dataportal, Uber’s Databook, Netflix’s Metacat, Lyft’s Amundsen, and most recently Google’s Data Catalog.

There’s a lot happening, and with good reason. These companies have truly massive amounts of data, and massive, highly trained, and expensive workforces who consume it. Once companies and teams grow beyond some fairly fundamental human limits, they have to rely on tooling to aid in discovery. The post goes through lots of updates to Linkedin’s product (previously called WhereHows) including lessons that the team has learned over the past 3 years.

Really good read. I’m confident that products like this are coming to your team over the coming 1-4 years.


Machine learning, faster

I recently gave a couple of conference presentations about how we are thinking about speed when developing machine learning systems at Monzo. This post covers some of the background to the points I was making in my talks, as well as what we’re doing in the Monzo machine learning team to speed up our own work.



SQL 201: Optimizing Queries, Regardless of Platform

These are the two single best paragraphs I’ve ever read on the topic of SQL optimization:

Query optimization is about computers doing less work at query time. Making your queries fast boils down to making your queries do less work for the same results. There are many different strategies for achieving that goal, and it takes technical knowledge to know which strategy to employ.

Doing less work means understanding 2 things: (a) know what your DB is doing, and (b) know how to adjust what you’re commanding the DB to do, to do less work.

This is a must-read if you write SQL. If you have teammates who write SQL, send this to them.


Data Science Best Practices @ Ravelin

The Ravelin team has four core data science principles:

  1. All new starters (of any seniority) will build, train and deploy production models within their first week.

  2. Automate the automatable and use humans for the rest.

  3. Deploy models incrementally and often.

  4. End users will never notice a model change, other than improved results

The post goes super-deep into their specific practices in each of these areas. Impressive.


Presto Infrastructure at Lyft - Lyft Engineering

Presto Infrastructure at Lyft - Lyft Engineering

Early in 2017 we started exploring Presto for OLAP use cases and we realized the potential of this amazing query engine.

…and thus begins a love story for the ages ;) The Lyft team has truly made some huge investments in the overall platform and has a serious Presto infrastructure stood up internally (roughly ~ 1/3 the aggregate memory of Pinterest’s infrastructure).

The thing I always wonder about companies using Presto, though, is their comparison set. Lyft migrated their query workloads from both Hive and Redshift and found Presto to be a better choice than either. My guess is that they didn’t evaluate either Snowflake or Bigquery, though, given that both platforms were significantly less mature in 2017 when their original migration was in flight. I still haven’t seen a heads-up comparison of Presto vs. either of these more modern analytic databases.

It’s articles like this where the Presto team announces a 10x improvement in its UNNEST operation that actually make me believe that it’s meaningfully behind; unnesting has been almost surprisingly fast on Snowflake for years.


Thanks to our sponsors!

dbt: Your Entire Analytics Engineering Workflow

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123