Data Science Roundup #50: A Superintelligent Hedge Fund, 100 Billion Lessons Learned, and the Rise of the Digital Fingerprint

This week's best data science articles

Super Intelligence for The Stock Market

Numerai is synthesizing machine intelligence to command the capital of an American hedge fund using crowdsourcing and ensembles. This is interesting stuff—I don’t feel qualified to have an opinion on whether it’s as big as they claim, but you definitely need to read this. Here are two more great posts from the team if you’re a finance data geek.


A Concise History of Neural Networks

If you’re anything like me, you know that deep learning isn’t a new concept, that it has its roots many decades ago. You’re probably familiar with the name Marvin Minsky, and you know that today’s breakthroughs rely on these theoretical breakthroughs from the past. This article fills in the gaps, giving you a primer on the history of what may be the most important technology of our time. And it’s only a six-minute read. Do yourself a favor: read it.


Handy Python Libraries for Formatting and Cleaning Data

These Python libraries will make the crucial task of data cleaning a bit more bearable—from anonymizing datasets to wrangling dates and times. I’m personally going to check out PrettyPandas, as I definitely need more formatting control over the data tables I output for my clients.


100 Billion Records Later, Refining our ETL Service

This post, written by the VP Engineering at Stitch, goes deep into the challenges faced in building a data pipeline that has delivered 100 billion records over its first 10 months. My personal takeaway: data engineers may sometimes be too quick to incorporate open source tools. The Stitch team has now removed Spark from their stack and reduced its usage of Kafka. Very interesting lessons that may save you thousands of engineering hours.


Asking good questions is hard (but worth it)

We all know that asking questions is an important skill, but have you ever had someone actually attempt to teach you how to ask a good question? Much of data science is figuring out how to formulate the best possible question, and as it turns out, you might have a lot to learn.


Internet Tracking Has Moved Beyond Cookies

Much of the source data for data science comes from user clickstream data. And the key to the entire clickstream is the cookie: a clever hack invented in the 90’s. But in the ad tech industry, cookies are gradually being shunted in favor of fingerprinting. Read this article if you do (or plan to do) any work with clickstream data.


Data viz of the week

Far better usage of this chart type than GA's Behavior Flow tab.

Far better usage of this chart type than GA's Behavior Flow tab.

Pay it forward!

I curate the Roundup on my nights and weekends because of the amazing support I get from readers. Know any data scientists that would enjoy reading it? Please send them here (or forward this email). Thanks!

Thanks to our sponsors :D

Fishtown Analytics

Fishtown Analytics is a boutique analytics consultancy serving high-growth, venture-funded startups. Have analytics questions? Let’s chat.



Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123