Data Science Roundup #83: OCR @ Dropbox, The Nature of Knowledge, Data Validation & more!

We just got back from DataEngConf in San Francisco last week. Congrats to our friends at Hakka Labs for putting on a great event! Highly recommended.

- Tristan

Referred by a friend? Sign up here!

Two Posts You Can't Miss

Our Machines Now Have Knowledge We’ll Never Understand

Fascinating post. The author goes back to the very earliest writing we have on what it means to know something:

Back at the beginning of Western culture’s discovery of knowledge, Plato told us that it’s not enough for a belief to be true…knowledge in the West has consisted of justifiable true beliefs — opinions we hold for a good reason.

Here’s my favorite line in the piece:

The machine-learned way of seeing might be more reflective of how the world actually is than purely human knowledge could ever be.

What if most knowledge isn’t easily reducible to symbolic logic but requires a network with billions of weights to know?

The truth is that, as we begin to discover more about our own neural processes, we actually don’t understand nearly as much about our own mechanisms of reasoning as we had previously believed. Knowledge, even our own knowledge, has always been incomprehensible to us.

Long read, but very worthwhile. Also worth a look: The Myth of a Superhuman AI.


Examining the Arc of 100,000 Stories

It turns out that a large corpus of stories plus some fairly straightforward sentiment analysis can tell us quite a lot about the human preference for narrative structure. The chart below is a rather profound summarization of 112,000 stories: the relative sentiment of words based on where they appear within a story. You can tell exactly when Sam and Frodo enter Mordor and when they throw the Ring into Mount Doom.

I’m fascinated by this not because it tells us something we didn’t know, but because of what amazing choices the author made in his analysis to arrive at such clear conclusions. Aspirational.


Every story ever.

Every story ever.

This Week's Top Posts

Dropbox: Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning

From crowdsourcing to convolutional layers to training to production, the Dropbox engineering team outlines every step in their OCR pipeline. Impressive.


Data Validation with the assertr Package

Data validation is a topic I care a lot about and it doesn’t get nearly enough attention. This package has some wonderful validation constructs: even if you don’t spend a lot of time in R, it’s worth reading this piece purely for its approach to scalable data validation.


Why Good Data Scientists Make Good Product Managers

Now that you’re already a data scientist, maybe it’s time to consider moving over to product? The article points out the similarities between the roles:

  • Data scientists and product managers make decisions with data.

  • Data scientists and product managers work cross-functionally.

  • Data scientists and product managers choose an objective function and ruthlessly optimize for it.

Definitely agree. I’d be curious if any readers have made this transition.


Banning exploration in my infovis class

The purpose of exploratory data analysis is the finding, not the exploring.

In perceptual classification, the analyst looks at the data and matches what they see against familiar patterns. In perceptual clustering , the analyst finds groups of similar patterns without necessarily leveraging known patterns.


Subreddit Mapping and Analysis

I covered this article several weeks ago. Now, a new author has picked up the dataset and created an entire mapping of the similarity of the top 10k subreddits. Great map, great walkthrough. Includes code.


The distance between Espresso and Cappuccino

Customer: Espresso? But I ordered a cappuccino!Robot: Don’t worry, the cosine distance between them is so small, that they are almost the same thing.

I can’t tell if this is funny or not (maybe a little?), but I’m fascinated that people are producing data science humor. I can’t imagine a less funny topic.


Data viz of the week

More story data, different source! He kidnaps, she screams.

More story data, different source! He kidnaps, she screams.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123