Data Analyst Communities and Hats. OpenAI & PyTorch. Football Dataviz. Open Data License Headaches. [DSR #216]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

Data Analysts Need Communities

Data Analysts Need Communities

My most recent post is an announcement of the first-ever global dbt community conference, Coalesce, but more importantly is a discussion of the importance of communities for data analysts:

It’s communities that are responsible for the unbelievably fast rate of innovation in the way that software is written. The migration from waterfall to XP to Agile, the migration from bare metal to virtualization to cloud to containerization — each of these implies changes in the way that software is built. To take advantage of these shifts, practitioners need to be constantly evolving, constantly learning, and the field needs to go through this process together.

Communities act as the primary transmission vector for these (and so many other) new software engineering practices. Communities determine what’s exciting and how it gets used, what new practitioners learn, and where investment dollars flow. And in turn, they give participants professional development, a sense of identity, and a way to give back.

If analytics is a subfield of software engineering, then analysts need communities every bit as vibrant as those in software engineering.


The many hats of a data analyst

I have often found that it is difficult to explain to business stakeholders what exactly a data analyst on my analytics team does. And even beyond that, I have sometimes found it difficult to explain to the data analysts themselves why their job is so valuable to the business and what their future career opportunities might look like.

Living in this world all-day-every-day, this entire post really rings true for me and builds nicely on the work that Caitlin Moorman did in her great posts on analyst career ladders.

The one thing I think I disagree with is the sense that data analysts need to do something else to continue their career trajectories—move to bizops or product or data science or management. I think this may be true at small startups, but at-scale data teams are building individual contributor career tracks just like software engineers: junior, senior, staff, and principle data analysts. I’m not saying you must advance your career track as an IC data analyst, but I believe that you can. As you progress you work on more business-critical, more challenging problems.


The Big List of Data Science Interview Resources

Conor Dewey @ Squarespace published a well-organized and exhaustive list of resources for folks looking to break into their first position.


License Friction: A Tale of Two Datasets

This is an extremely unusual, but really very interesting, post. Open data has only been had meaningful attention for the past ~ 10 years, and most datasets are published in complete isolation from one another. But integrated datasets provide far more value. This is a story where licensing issues have prevented two open datasets from being used together.

Open source software has an accepted license regime that (while currently going through some churn) has been relatively stable for quite some time. Open data hasn’t had that same maturation process. There is likely going to be a lot of work still to do on this front.



We are standardizing OpenAI’s deep learning framework on PyTorch. In the past, we implemented projects in many frameworks depending on their relative strengths. We’ve now chosen to standardize to make it easier for our team to create and share optimized implementations of our models.

OpenAI may not have quite the heft in the field that it did several years ago, but this is still a big deal in the framework wars. Back in October I linked to the best data on this topic that I’d found, and OpenAI’s shift reinforces the industry’s shift towards PyTorch.


Near-perfect point-goal navigation from 2.5 billion frames of experience

The AI community has a long-term goal of building intelligent machines that interact effectively with the physical world, and a key challenge is teaching these systems to navigate through complex, unfamiliar real-world environments to reach a specified destination — without a preprovided map. We are announcing today that Facebook AI has created a new large-scale distributed reinforcement learning (RL) algorithm called DD-PPO, which has effectively solved the task of point-goal navigation using only an RGB-D camera, GPS, and compass data. Agents trained with DD-PPO (which stands for decentralized distributed proximal policy optimization) achieve nearly 100 percent success in a variety of virtual environments, such as houses and office buildings.

There’s a lot in this post. The navigation accomplishment itself is impressive, and the discussion of the scaling properties of DD-PPO is also interesting. This was a topic area I hadn’t delved into before.


How Much Football Is Even In A Football Broadcast?

How Much Football Is Even In A Football Broadcast?

Answer: not much.

The broadcast lasted three hours and 15 minutes, but it included 18 separate commercial breaks that in total lasted 43 minutes — not including the halftime break. In sum, the game’s 107 total plays gave us 14 total minutes (and 16 seconds) of football action. In other words, those who settled in to watch the entire NFC championship endured a commercial-to-action ratio of over 3-to-1.

This fact isn’t surprising for any NFL-watching readers, but I link to this post because of the brilliant custom visualization that FiveThirtyEight designed for it (above). The information density is impressive.


Thanks to our sponsors!

dbt: Your Entire Analytics Engineering Workflow

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123