Data Science Roundup #77: Artificial Agents Create Their Own Language, Annotated Audio Data, & more!

Happy Sunday! Thanks, as always, for reading.

If you enjoy reading the Data Science Roundup, I’d appreciate it if you could forward this email to three friends. It’s your referrals that keep us growing! 🙏🙏

- Tristan

Referred by a friend? Sign up here!

Two Posts You Can't Miss

Learning to Communicate

OpenAI agents invented a language from scratch:

Our approach yields agents that invent a (simple!) language which is grounded and compositional. Grounded means that words in a language are tied to something directly experienced by a speaker in their environment, for example, a speaker forming an association between the word “tree” and images or experiences of trees. Compositional means that speakers can assemble multiple words into a sentence to represent a specific idea, such as getting another agent to go to a specific location.

Must read.


An Upgrade to SyntaxNet

Google just released a new version of SyntaxNet, incorporating the results of over a year of NLP research. Consider the following sentence: “The gostak distims the doshes.”

This sentence was originally coined by Andrew Ingraham who explained: “You do not know what this means; nor do I. But if we assume that it is English, we know that the doshes are distimmed by the gostak. We know too that one distimmer of doshes is a gostak.“ Systematic patterns in morphology and syntax allow us to guess the grammatical function of words even when they are completely novel: we understand that ‘doshes’ is the plural of the noun ‘dosh’ (similar to the ‘cats’ example above) or that ‘distim’ is the third person singular of the verb distim. Based on this analysis we can then derive the overall structure of this sentence even though we have never seen the words before.


This Week's Top Posts

Lynchburg, Virginia: The Most Typical City in America

Lynchburg, Virginia: The Most Typical City in America

I crunched the numbers on eight measures of 917 cities to learn what constitutes a typical city in America. Here’s what I found.

An almost surprisingly interesting post, given what a common dataset the author is working with. Great reminder of how important storytelling is.


The 7 Types of Data Scientists

“Data scientist” is certainly a term that takes its fair share of criticism. My main problem with the term is that it is actually too broad: the variance in skillset for someone with a data scientist title is incredibly high.

Companies in the market for data science talent should think long and hard about which of these profiles they’re actually looking for.


50 Companies Leading The AI Revolution

There are many, many startups today incorporating AI into their products and services. This article presents 50 of the largest / most well-funded, and the list is well worth a look.


AudioSet: A Large-Scale Dataset of Manually Annotated Audio Events

AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.

By releasing AudioSet, we hope to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events.

Google is on a roll: this dataset could potentially be as important as ImageNet.

Deep Learning: Tips and Tricks for the Practitioner

This post by data scientist Nikolas Markou is making the rounds right now. In it, he presents an exhaustive bulleted list of detailed recommendations for how to tune a neural network. All substance, no fluff.


Learning AI if You Suck at Math: Tensors

Have you ever been asked “What exactly is a tensor?” and wished you had a more coherent answer? If so, this post is for you.


When Americans Lost Their Virginity

tl;dr: 18


Data viz of the week

The data for the visualization below comes from 770,000 tubes of saliva analyzed by It’s hard to get a sense from the embedded version, but there are some great stories played out in the details. Click through to see the larger version.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Growth

Fishtown Analytics works with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123