Data Science Roundup #81: Curating the Linkedin Feed, Trends in Machine Learning, and some Data Eng 😌 😌
Some weeks are quiet, some weeks are awesome.
If you’re actively working in an analytics role, please make sure to read the very first post—it’s the best treatment of data organizational behavior I’ve ever read.
Referred by a friend? Sign up here!
Two Posts You Can't Miss
I frequently dislike posts about organizational dynamics associated with data: these posts are often so fluffy that they’re just not actionable. This post is the exception. Brian Balfour, the ex-VP Growth @ Hubspot does an incredible job outlining several all-too-common mistakes that companies make when institutionalizing analytics.
As always, here’s the TL;DR:
Project Mindset vs Process Mindset (+1000)
Misalignment of Incentives
Data team becomes the bottleneck
Brilliant Answers, Useless Questions
Having implemented analytics at over a dozen companies at this point I can tell you that this stuff is real (and it’s probably going on in your company right now).
After a year where Facebook has been highly criticized for its curation practices, Linkedin seems to be taking a different approach. It even gets in a satisfying jab at its much-larger sibling:
What may pass as acceptable content on a general social network may not be a pleasing experience for a professional social network like LinkedIn.
Linkedin incorporates both user reporting and employee review of reports for its user-generated content. And the post goes into detail about the potential remedies it metes out:
We have heavily invested in our platforms to allow for a spectrum of treatments, including: demoting content in the feed ranking; restricting content to the immediate neighborhood of the poster; limiting places where content can surface (e.g., it can be seen on a member’s profile page but not on the feed); making it undiscoverable on the whole site; and, in extreme cases, also disabling the poster.
This level of transparency about a core social curation algorithm is very unusual in the space. Encouraging.
This Week's Top Posts
This post presents a collection of event data infrastructure diagrams from the world’s biggest internet companies. Amazing resource; highly recommended.
This is the first major release of the Jupyter Notebook since version 4.0 and the “Big Split” of IPython and Jupyter. It introduces some new features and many improvements and bug fixes, totaling about 133 closed issues, 303 merged PRs, and 9 months in the making.
The big three new features:
Cell tagging (enabling further extensions)
Less hideous table styles (phew)
For more details, view the changelog.
Dedupe is a library that uses machine learning to perform deduplication and entity resolution quickly on structured data.
Deduplication is an incredibly common problem that is still painful to solve. This new package presents an elegant solution.
Text analysis of 28,303 machine learning papers reveals what researchers at the cutting edge of ML are up to. Most amusing quote:
Geoff Hinton is mentioned in more than 30% of all new papers! That seems like a lot.
That does indeed seem like a lot :)
The Data Scientist expected the data pipeline to already be created when they were hired. The company and managers are expecting the Data Scientist to create the data pipeline. When I’ve encountered this issue, the Data Scientist has been idle for 2-6 months. After about 6 months they’ll quit.
All-too-common. Don’t do this.
This is a fascinating, compelling analysis. The author uses network analysis to define clusters and sentiment analysis within clusters to identify outliers. The result: identification of specific individuals who are well-positioned to be a bridge between groups with competing ideas.
Must read. Includes code.
Data viz of the week
Super-cool interactive. Click through to run your own ride-share network.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123