Data Science Roundup #65: Slack's Data Infrastructure, the (Human) History of AI, & more!
Data infrastructure @ Slack and Blue Apron. Andrew Ng gives a master class on building deep learning that works. Open data from Stack Overflow. The very human history of AI. And: introducing type safety to statistical programming?
If you’ve been sent this newsletter by a friend, do me a favor and sign up. It’s your subscriptions that keep The Data Science Roundup growing!
Thanks 😁 😁
PS: We’re hiring!
Focus on: Data Infrastructure in the Wild
Two posts on practical applications: world-class data teams solving hard problems. Highly recommended.
If you stop and think about the number of Slack conversations you have personally participated in and then multiply by, oh, 100 million or so users, you start to get a sense of the scale problem that the data team at Slack faces. Their solutions? S3, Kafka, Presto, Hive, and Spark, all reading and writing Parquet. To me, this reads as an engineering-heavy and open-source-focused stack; the post goes into some of the (non-trivial) challenges they had in making this work.
Exercise for the reader: compare and contrast with the Blue Apron experience below.
Blue Apron uses BigQuery as the core of their data infrastructure, piping data from Kafka and Postgres in via Airflow jobs. They optimize their tables via date partitioning while letting BigQuery handle schema updates. This post uses words like “effortless” and “confident”.
While the Slack infrastructure still represents the dominant approach at savvy companies, it’s notable how much more straightforward Blue Apron’s BigQuery-based stack seems to be, and how much scalability it offers.
This week's best data science articles
Andrew Ng just delivered a master class on applied deep learning at NIPS2016:
Ng highlighted the fact that while NIPS is a research conference, many of the newly generated ideas are simply ideas, not yet battle-tested vehicles for converting mathematical acumen into dollars. The bread and butter of money-making deep learning is supervised learning with recurrent neural networks such as LSTMs in second place. Research areas such as Generative Adversarial Networks (GANs), Deep Reinforcement Learning (Deep RL), and just about anything branding itself as unsupervised learning, are simply Research, with a capital R.
The talk then goes on to give concrete recommendations on how to tune supervised deep learning models, recommending a focus on fundamentals.
If you’re going to use any of your coming holiday time off to develop your data science skills, now you can do it on top of data from Stack Exchange by querying the data directly in BigQuery. Whether you’re looking to improve your SQL, ML, or visualization skills, there’s plenty of meat in this dataset to work with. Here’s some example analysis in R to get you thinking.
“How Google used artificial intelligence to transform Google Translate, one of its more popular services — and how machine learning is poised to reinvent computing itself.”
This piece appeared in a recent NYTimes Magazine. It is long, engaging telling of the birth of AI, and focuses much more on the humans involved than any telling I had previously read. We all end up reading so much about AI itself; I found it fascinating to read the very human story of its genesis.
What if survey methodologies, or experimental designs, were represented in programming languages as data types, and constrained the application of subsequent logic?
This post feels a bit ephemeral for most of its length, but makes some proposals at the end that I found deeply interesting. Is it possible that we need to encode our understanding of statistical validity more directly into our statistical programming languages?
Data viz of the week
"A Week of Laughter" from Dear Data, a project in data as art. Beautiful.
Thanks to our sponsors!
Fishtown Analytics works with venture-funded startups to implement Redshift, BigQuery, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123