Data Science Roundup #62: Reproducible Research @ Stripe, plus easy mapping viz, image recognition & more!

Nov 27, 2016

The coming age of image recognition. How to avoid getting overwhelmed as you learn. Reproducible research @ Stripe. Avoiding common statistical mistakes. Easy mapping in R. And the world’s biggest ML & AI resource guide.

A Thanksgiving ask: The Roundup is forwarded to hundreds of new people each week! If you’ve been sent this newsletter by a friend, do me a favor and sign up. It’s your subscriptions that keep The Data Science Roundup growing!

Thanks :D

- Tristan

This week's best data science articles

Cameras, eCommerce and Machine Learning

This week’s “big think” article focuses on image recognition and its implications. “We should expect that every image ever taken can be searched or analyzed, and some kind of insight extracted, at massive scale. (…) When we can turn images into data, we’ll find lots of sets of images that we never really thought of as data before, and lots of problems that didn’t look like image recognition problems.”

ben-evans.com • Share

What are some tips for a beginning ML/data scientist who feels overwhelmed?

Great, short, Quora answer from an experienced data scientist. “Instead of adding everything that we stumble upon to our reading lists, I’d say that it makes more sense to be absolutely clear about personal goals first. Since there’s so much material out there, it’s become necessary to be a bit more selective when choosing learning material and exploring different tools. Of course, it sometimes feels like we are missing out on something, but I think that getting used to this feeling really helps to stay focussed and to make steady progress.”

www.quora.com • Share

Reproducible research: Stripe’s approach to data science

The data team at Stripe has heavily invested in reproducibility, with great results. In this post, they share how their team publishes internal research that is then reproducible from scratch by any member of the team, current or future. Git, Jupyter, and internally-built tools are all at the heart of this workflow.

This is a must-read. Data teams need to think of their outputs as research, and need to be focused on building high-quality mechanisms by which this research gets produced and maintained.

stripe.com • Share

Statistical Mistakes and How to Avoid Them

If you find yourself in an analytics role but aren’t heavy on stats, read this post. In it, the author provides guidance on the bare minimum statistics you need to know to produce reliable analytics. Significance testing, confidence intervals, and (my favorite!) how to deal with the multiple comparisons problem.

www.cs.cornell.edu • Share

German Gas Prices Illustrated

Producing high-quality mapping visualizations used to be hard, but at this point, if you’re visualizing data that has a spatial component to it and you’re not using a map, you’re doing it wrong. This article uses R’s ggmap to draw several different maps of gas price data in Germany, and each map takes 3-5 lines of code.

flovv.github.io • Share

The World's Biggest Machine Learning & Artificial Intelligence Index

Um. Wow. This is a collection of every blog, every company, every person, and every conference focused on ML & AI. I can only imagine what a massive effort this was to pull together, and to my knowledge it’s the most extensive resource of its kind. I highly recommend browsing through; I’ve added a bunch of new resources to my regular feeds.

medium.com • Share

Data viz of the week

Introducing the "Troll Hair Chart". Great way to show many stacked time series.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

Fishtown Analytics works with venture-funded startups to implement Redshift, BigQuery, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.

fishtownanalytics.com • Share

Stitch: Simple, powerful ETL built for developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.

www.stitchdata.com • Share

By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

915 Spring Garden St., Suite 500, Philadelphia, PA 19123

The Analytics Engineering Roundup

Discussion about this post

Ready for more?