Data Science Roundup #81: Curating the Linkedin Feed, Trends in Machine Learning, and some Data Eng 😌 😌

Some weeks are quiet, some weeks are awesome.

If you’re actively working in an analytics role, please make sure to read the very first post—it’s the best treatment of data organizational behavior I’ve ever read.

- Tristan

Referred by a friend? Sign up here!

Two Posts You Can't Miss

How to Battle the "Data Wheel of Death"

I frequently dislike posts about organizational dynamics associated with data: these posts are often so fluffy that they’re just not actionable. This post is the exception. Brian Balfour, the ex-VP Growth @ Hubspot does an incredible job outlining several all-too-common mistakes that companies make when institutionalizing analytics.

As always, here’s the TL;DR:

  • Project Mindset vs Process Mindset (+1000)

  • Misalignment of Incentives

  • Data team becomes the bottleneck

  • Brilliant Answers, Useless Questions

Having implemented analytics at over a dozen companies at this point I can tell you that this stuff is real (and it’s probably going on in your company right now).


Strategies for Keeping the LinkedIn Feed Relevant

After a year where Facebook has been highly criticized for its curation practices, Linkedin seems to be taking a different approach. It even gets in a satisfying jab at its much-larger sibling:

What may pass as acceptable content on a general social network may not be a pleasing experience for a professional social network like LinkedIn.

Linkedin incorporates both user reporting and employee review of reports for its user-generated content. And the post goes into detail about the potential remedies it metes out:

We have heavily invested in our platforms to allow for a spectrum of treatments, including: demoting content in the feed ranking; restricting content to the immediate neighborhood of the poster; limiting places where content can surface (e.g., it can be seen on a member’s profile page but not on the feed); making it undiscoverable on the whole site; and, in extreme cases, also disabling the poster.

This level of transparency about a core social curation algorithm is very unusual in the space. Encouraging.


This Week's Top Posts

Data Stacks at Facebook, Netflix, Airbnb, and Pinterest

This post presents a collection of event data infrastructure diagrams from the world’s biggest internet companies. Amazing resource; highly recommended.


Jupyter Notebook 5.0

This is the first major release of the Jupyter Notebook since version 4.0 and the “Big Split” of IPython and Jupyter. It introduces some new features and many improvements and bug fixes, totaling about 133 closed issues, 303 merged PRs, and 9 months in the making.

The big three new features:

  • Cell tagging (enabling further extensions)

  • Editable shortcuts

  • Less hideous table styles (phew)

For more details, view the changelog.


Basics of Entity Resolution with Python and Dedupe

Basics of Entity Resolution with Python and Dedupe

Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data.

Deduplication is an incredibly common problem that is still painful to solve. This new package presents an elegant solution.


A Peek at Trends in Machine Learning

Text analysis of 28,303 machine learning papers reveals what researchers at the cutting edge of ML are up to. Most amusing quote:

Geoff Hinton is mentioned in more than 30% of all new papers! That seems like a lot.

That does indeed seem like a lot :)


What Happens When You Hire a Data Scientist Without a Data Engineer

The Data Scientist expected the data pipeline to already be created when they were hired. The company and managers are expecting the Data Scientist to create the data pipeline. When I’ve encountered this issue, the Data Scientist has been idle for 2-6 months. After about 6 months they’ll quit.

All-too-common. Don’t do this.


Promoting Positive Climate Change Conversations via Twitter

This is a fascinating, compelling analysis. The author uses network analysis to define clusters and sentiment analysis within clusters to identify outliers. The result: identification of specific individuals who are well-positioned to be a bridge between groups with competing ideas.

Must read. Includes code.


Data viz of the week

Super-cool interactive. Click through to run your own ride-share network.

Super-cool interactive. Click through to run your own ride-share network.

Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123