Data Science Roundup #65: Slack's Data Infrastructure, the (Human) History of AI, & more!

Data infrastructure @ Slack and Blue Apron. Andrew Ng gives a master class on building deep learning that works. Open data from Stack Overflow. The very human history of AI. And: introducing type safety to statistical programming?

If you’ve been sent this newsletter by a friend, do me a favor and sign up. It’s your subscriptions that keep The Data Science Roundup growing!

Thanks 😁 😁

- Tristan

PS: We’re hiring!

Focus on: Data Infrastructure in the Wild

Two posts on practical applications: world-class data teams solving hard problems. Highly recommended.

Data Infrastructure @ Slack

If you stop and think about the number of Slack conversations you have personally participated in and then multiply by, oh, 100 million or so users, you start to get a sense of the scale problem that the data team at Slack faces. Their solutions? S3, Kafka, Presto, Hive, and Spark, all reading and writing Parquet. To me, this reads as an engineering-heavy and open-source-focused stack; the post goes into some of the (non-trivial) challenges they had in making this work.

Exercise for the reader: compare and contrast with the Blue Apron experience below.

slack.engineeringShare

Data Infrastructure @ Blue Apron

Blue Apron uses BigQuery as the core of their data infrastructure, piping data from Kafka and Postgres in via Airflow jobs. They optimize their tables via date partitioning while letting BigQuery handle schema updates. This post uses words like “effortless” and “confident”.

While the Slack infrastructure still represents the dominant approach at savvy companies, it’s notable how much more straightforward Blue Apron’s BigQuery-based stack seems to be, and how much scalability it offers.

bytes.blueapron.comShare

This week's best data science articles

Nuts and Bolts of Building Deep Learning Applications

Andrew Ng just delivered a master class on applied deep learning at NIPS2016:

Ng highlighted the fact that while NIPS is a research conference, many of the newly generated ideas are simply ideas, not yet battle-tested vehicles for converting mathematical acumen into dollars. The bread and butter of money-making deep learning is supervised learning with recurrent neural networks such as LSTMs in second place. Research areas such as Generative Adversarial Networks (GANs), Deep Reinforcement Learning (Deep RL), and just about anything branding itself as unsupervised learning, are simply Research, with a capital R.

The talk then goes on to give concrete recommendations on how to tune supervised deep learning models, recommending a focus on fundamentals.

www.computervisionblog.comShare

You Can Now Play with Stack Overflow Data on Google’s BigQuery

If you’re going to use any of your coming holiday time off to develop your data science skills, now you can do it on top of data from Stack Exchange by querying the data directly in BigQuery. Whether you’re looking to improve your SQL, ML, or visualization skills, there’s plenty of meat in this dataset to work with. Here’s some example analysis in R to get you thinking.

stackoverflow.blogShare

The Great A.I. Awakening

“How Google used artificial intelligence to transform Google Translate, one of its more popular services — and how machine learning is poised to reinvent computing itself.”

This piece appeared in a recent NYTimes Magazine. It is long, engaging telling of the birth of AI, and focuses much more on the humans involved than any telling I had previously read. We all end up reading so much about AI itself; I found it fascinating to read the very human story of its genesis.

www.nytimes.comShare

Type Safety and Statistical Computing

What if survey methodologies, or experimental designs, were represented in programming languages as data types, and constrained the application of subsequent logic?

This post feels a bit ephemeral for most of its length, but makes some proposals at the end that I found deeply interesting. Is it possible that we need to encode our understanding of statistical validity more directly into our statistical programming languages?

www.johnmyleswhite.comShare

Data viz of the week

"A Week of Laughter" from Dear Data, a project in data as art. Beautiful.

"A Week of Laughter" from Dear Data, a project in data as art. Beautiful.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

Fishtown Analytics works with venture-funded startups to implement Redshift, BigQuery, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.

fishtownanalytics.comShare

Stitch: Simple, powerful ETL built for developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.

www.stitchdata.comShare

By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123