Are you a Real Data Scientist? The next ten Starbucks in NYC. Communicating Uncertainty. [DSR #103]

Thanks to the folks at Looker for a great JOIN 2017! Lots of great conversations with lots of super-smart people.

Enjoy this week’s roundup, a bit shorter than usual as I’m pulling it together from my redeye home :)

- Tristan

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

Two Posts You Can't Miss

I am not a real data scientist.

I have never used a deep learning framework, like TensorFlow or Keras. I have never touched a GPU. I don’t have a degree in computer science or statistics. My degree is in mechanical engineering, of all things. I don’t know R. But I haven’t given up hope. After reading a bunch of job postings, I figured out that all it will take to become a real data scientist is five PhD’s and 87 years of job experience.

This is an absolutely wonderful post about the prevalence of imposter syndrome in data science. You will almost definitely experience this feeling (I know I do!). But instead of focusing on credentials obtained and tools used, the author proposes that you are a real data scientist if you are asking good questions and answering them with data.



The Next Wave: Predicting the future of coffee in New York City

The Next Wave: Predicting the future of coffee in New York City

In this article we look at New York City through the lens of coffee in an attempt to explore a fundamental question of spatial economics: how are the locations of businesses determined?

This is a super-legit analysis, incorporating factors including day and night population, culture, and pricing to come up with surprisingly specific determinations of exactly where Dunkin Donuts, Starbucks, and the various third wave shops should place their next ten locations.

I always enjoy seeing someone do the best version of a thing. This analysis is inspirational for anyone trying to come to a specific answer to a complicated problem.


This Week's Top Posts

Communicating Uncertainty When Lives Are on the Line

People are not good at reasoning about uncertainty. When uncertain information is life-and-death important for the general public, thinking through exactly how that visualization is designed is critical. This thoughtful piece walks through a selection of Hurricane Irma graphics that all communicate future uncertainty and examines the techniques that make each of them good (or not).


Sketchy Data Visualization in Semiotic

More on communicating uncertainty: Semiotic now contains very compelling “sketchy” data visualization options:

Crisp, perfect data visualization is effective and powerful, but data visualization is simply communication, and sometimes what you want to communicate precisely is that the data is imprecise. Sometimes you want to use scientifically proven principles of visual display of information to communicate that the results are not scientifically proven.


Why is Python Growing So Quickly?

I recently shared the KDNuggets survey results that Python has overtaken R as the most popular language for data science practitioners. This analysis goes deep into what is driving this growth, using StackOverflow activity as source data.

Popularity aside, the author concludes by saying that he’s going to continue using R. Popularity isn’t everything.


New AI can work out whether you're gay or straight from a photograph

An algorithm deduced the sexuality of people on a dating site with up to 91% accuracy, raising tricky ethical questions.

There are other wide-ranging applications of this particular algorithm—the author believes it could potentially predict IQ and other minority-report-like things.


Fast Track Apache Spark: 6 lessons learned to get a quick start on productivity

More data scientists should be setting up and playing with Spark. Don’t wait for your organization to make a massive investment to start playing around; spin it up locally and hit it with Jupyter.

Good, quick read.


How I failed to replicate an $86 million project in 1 line of code

Ryan Baumann did not agree with the post I linked to last week titled How I replicated an $86 million project in 57 lines of code. He tried the same experiment himself and showed that the current state of open source license plate recognition is not where the previous author seemed to indicate:

Could this project be done for less than $86M? Maybe. Could they use OpenALPR as a starting point? Also maybe. Would it actually reduce the cost? Who knows.


Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123