Data Science Roundup #78: Big Week! Analyzing /r/The_Donald, Writing Good Code, & much more!

So much good stuff out this week! Don’t miss the 538 article, it’s one of my favorite pieces of data journalism ever.

Have anything you want to see included? Shoot me an email.

- Tristan

Referred by a friend? Sign up here!

Two Posts You Can't Miss

538: Dissecting Trump’s Most Rabid Online Following

Have you ever visited /r/The_Donald/? If not, you should take just a minute to do so before reading this. Try not to fall down the black hole.

In this post, 538 analyzed the comments on thousands of subreddits and then used an algorithm to “add” and “subtract” the various communities from one another. The results are compelling.

I’m really genuinely impressed with the work that 538 put into this article: the analysis is sophisticated, the visualizations are high quality, and the storytelling is compelling. They’re setting a very high bar for what good data journalism looks like.


Announcing Distill: A Modern Medium for Presenting Research

A joint launch between OpenAI, Google Brain, and YCombinator, Distill aims to provide a better mechanism for disseminating research on ML. From the Google announcement:

Science isn’t just about discovering new results. It’s also about human understanding. Scientists need to develop notations, analogies, visualizations, and explanations of ideas. This human dimension of science isn’t a minor side project. It’s deeply tied to the heart of science.

That’s why, in collaboration with OpenAI, DeepMind, YC Research, and others, we’re excited to announce the launch of Distill, a new open science journal and ecosystem supporting human understanding of machine learning. Distill is an independent organization, dedicated to fostering a new segment of the research community.

If you’ve ever read an ML paper, you know it’s not a great experience. I’m excited to see how much traction Distill gets.


This Week's Top Posts

What makes a great data scientist?


  • An obsession with solving problems, not new tools

  • A desire to find a solution even though it’s inevitably not perfect

  • Strong communication skills

Great post.


Reproducible Data Analysis in Jupyter

Data scientists know how to call libraries but frequently don’t go as deep in important software engineering skills like designing modular code, managing projects with git, and contributing to open source repos. This post focuses on how to write good code, collaboratively, within an ecosystem.



Stitch Fix Algorithms Tour

Ever wondered how a truly data-driven organization functions? This interactive website takes you on a tour through the entire operations of Stitch Fix and explains how data impacts every part of their org. I’ve never seen a company put together something quite like this before—unique and fascinating.


Datashader is a Big Deal

Datashader is a Big Deal

datashader makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them.

Very worth looking at the examples. Impressive.


Calculating CLV

This paper aims to layout the current state of Customer Lifetime Value calculation research. It is entirely practical, so mathematical descriptions will only be discussed where they are important from a practical perspective. It also aims to provide both code and spreadsheets to allow for usage of the models discussed.

This is the single best reference on calculating customer lifetime value I’ve ever seen. Bookmark this—you’ll need it at some point.


The eigenvector of "Why we moved from language X to language Y"

Good scraping, good analysis, interesting results. Turns out C isn’t going anywhere.


Evolution Strategies as a Scalable Alternative to Reinforcement Learning

OpenAI is churning out great work. This post shares some impressive results: newly trained evolutionary algorithms have matched the performance of their previous reinforcement learning models.

Does this open the door for the return of evolutionary algorithms?


A new “Mathematician’s Apology”

One academic’s argument that a purely theoretical mathematics undergraduate and graduate education is even more relevant now than it has ever been despite the lack of funding for traditional academic careers.

Great post. I can’t imagine a more valuable undergrad degree today.


An Introduction to Markov Chains

We just recently used Markov chains to do some marketing attribution for a client—the approach was surprisingly straightforward and the results were compelling. If you’ve never used Markov chains, this is a great resource.


Data viz of the week

The clarity of the visualization flows from the clarity of the question.

The clarity of the visualization flows from the clarity of the question.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

Fishtown Analytics works with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123