It was hard to narrow things down this week! This issue contains a mix of new research, practical applications, and compelling data journalism. Thanks for your feedback in recent weeks, keep it coming!
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
Two Posts You Can't Miss
Ever heard of the paperclip problem? It’s probably the most boring way ever imagined for humans to become extinct, and DeepMind and OpenAI take it very seriously. Their most recent approach avoids the problem altogether by not specifying a utility function, but instead training the network using human feedback:
…these results demonstrate one method to address this, by allowing humans with no technical experience to teach a reinforcement learning (RL) system - an AI that learns by trial and error - a complex goal. This removes the need for the human to specify a goal for the algorithm in advance. This is an important step because getting the goal even a bit wrong could lead to undesirable or even dangerous behaviour. In some cases, as little as 30 minutes of feedback from a non-expert is enough to train our system, including teaching it entirely new complex behaviours, such as how to make a simulated robot do backflips.
It’s a fascinating piece of research, and well-documented. As usual, the example applications are toys, but you can easily imagine armies of humans employed to train algorithms. Commercial applications abound.
Fresh out of school I joined Spotify as the first data analyst. One of my first projects was to understand conversion rates. Conversion rate from the free service to Premium is tricky because there’s a huge time lag.
Erik is 100% absolutely right: conversion rates are surprisingly challenging to measure when there is a significant delay between the top and bottom of a funnel. Here’s how he solved this problem at Spotify. My favorite line of the post:
conversion rates are pointless to try to quantify as a single number
100% agree. Conversion rates are one of the most commonly-measured metrics in tech, and yet this topic is still surprisingly poorly understood. Are you familiar with the Kaplan-Meier non-parametric estimator?
Extremely practical, highly recommended.
This Week's Top Posts
VC MMC Ventures has published an excellent explanation of their thesis in the frothy world of AI investing. If you’re building a product in the space, this is a must-read: it’s going to inform a lot of investor thinking.
Netflix is building a new data science team focused on content delivery. It turns out that serving “125 million hours of video every day, to 100 million members across the globe” is a very hard problem. This article is a fascinating look at how Netflix is using data science paired with hardcore systems engineering to achieve that insane level of scale.
This is grade-A data journalism. The most interesting part, maybe, is that this isn’t some major feature, it’s just…a pretty normal article. Data journalism is now just journalism. This is a very good thing.
We all often overlook the storytelling component of our jobs, but it’s is as critical of a part of the trade as any other (perhaps more so).
Mark Rittman has a really excellent post on his blog exploring one of the quirks of BigQuery: it’s actually not great at joining two large tables together specifically because of its embarrassingly parallel architecture. The post walks through how to model your data using nested structures to avoid large joins and leverage BQ optimally.
This article is an overview of the most popular anomaly detection algorithms for time series and their pros and cons.
Really, really useful.
Walsh and his colleagues have created machine-learning algorithms that predict, with unnerving accuracy, the likelihood that a patient will attempt suicide. In trials, results have been 80-90% accurate when predicting whether someone will attempt suicide within the next two years, and 92% accurate in predicting whether someone will attempt suicide within the next week.
After a natural disaster, humanitarian organizations need to know where affected people are located, what resources are needed, and who is safe. This information is extremely difficult and often impossible to capture through conventional data collection methods in a timely manner. As more people connect and share on Facebook, our data is able to provide insights in near-real time to help humanitarian organizations coordinate their work and fill crucial gaps in information during disasters.
Watch the video. Impressive.
Data Viz of the Week
Roman roads, subway style. I ordered a print.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123