The Year in AI. Better Career Decisions. Oversimplify. Forgetting User Data. Anonymizing PII in Free Text Fields. [DSR #212]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

The Year in AI: 2019 ML/AI Advances Recap

“Author summarizes 2019 advances in AI.” << this could describe a very large number of posts written in the past month. I’ve become quite allergic to year-end recaps (you likely have as well!), so I decided to include only the very best one. It’s by an author with a ton of industry bona fides who writes clearly and succinctly; there is zero hype.


Make Better Data Career Decisions With The 3 Levels of Data Analysis

Unlike software engineers, product managers, or designers, data team members serve internal users, not external ones. This means that your day-to-day experiences and your learning opportunities depend greatly on the organization you are joining. It means that you cannot assume that all data team roles (at different companies) are equal. And it means — as far as your career is concerned — that you must develop an ability to evaluate the data maturity of organizations before you choose to work for them.

This is absolutely true, and I think that this fact is responsible for quite a lot of data team turnover. Data professionals often are not good at sussing out the organizational dynamics that will determine exactly what their day-to-day will look like. This post provides a solid framework for doing just that.



I just ran across this; it’s from a couple of years ago but I had never read it before. The post recommends that data scientists simplify their conclusions almost to an absurd degree. The entire post is fantastic; here’s my favorite part:

Academia and industry have different goals. They are two worlds with different languages and currencies. The currency of academia is reputation, which you lose by being wrong. The currency of industry is currency, which you get by making decisions quickly and with conviction. Industry is the world of Gryffindor rather than Ravenclaw, more Kirk than Spock.

When a company officer asks for an analysis, they don’t usually care about the answer. What they are really asking for is an answer to the question “What should I do?” The data scientist who is capable of bridging the gap from raw data to recommending a course of action is a rare asset. Recommending action is interpreted as a sign of leadership, and tends to be rewarded with raises and promotions. It shows that you are looking past your 37-inch monitor, to the well-being and future of the company. This is deeply reassuring to company leaders, and highly valued.

You’re not being paid to say smart things, you’re being paid to make (or help make) decisions. Decisions are ultimately binary.


Microsoft Presidio: Context aware, pluggable and customizable data protection and anonymization service for text and images

This is cool! It provides functionality to parse unstructured text, identify PII, and filter / anonymize it. With the recent focus on data ownership with laws like CCPA and GDPR, there are lots of companies attempting to minimize their PII surface area, and unstructured text is a big one. A recent client didn’t want to sync ZenDesk ticket data into their warehouse because of the unstructured nature of ticket contents and file attachments (not at all unreasonable). This tool could help address concerns like this.

Ideally, off-the-shelf pipeline tools like Stitch and Fivetran would implement functionality like this(!).


“Amnesia” – Towards Machine Learning Models That Can Forget User Data Very Fast

Very timely with CCPA just taking effect…

Software systems that learn from user data with machine learning (ML) techniques have become ubiquitous over the last years. Recent law requires companies and institutions that process personal data to delete user data upon request (enacting the “right to be forgotten”). However, it is not sufficient to merely delete the user data from databases. ML models that have been learnt from the stored data can be considered a lossy compressed version of the data, and therefore it can be argued that the user data must also be removed from them. Typically, this requires an inefficient and costly retraining of the affected ML models from scratch, as well as access to the original training data.

We address this perfomance issue by formulating the problem of “decrementally” updating trained ML models to “forget” the data of a user, and present efficient decremental update procedures for three popular ML algorithms. In an experimental evaluation on synthetic data, we find that our decremental update is two orders of magnitude faster than retraining the model without a particular user’s data in the majority of cases.



A Cool SQL Problem: Avoiding For-Loops

This post is a neat interview problem, but I like it because it does a great job of teaching a concept that I find myself teaching junior data people very frequently. Here’s the meat of the article:

If you’re using procedural approaches to solve a query-based problem within a relational database, you’re probably doing something wrong!

A procedural approach is a solution that operates line-by-line, object-by-object. For-loops are a quintessential example of a procedural approach: instead of solving in one big operation, the procedural solution treats the problem as a series of small operations to be iterated out.

There are a few common cases where it makes sense to use for-loops in code that interacts with a relational database, such as performing a set of operations on each column from a list of column names, or re-running a model on various sets of training data. But once you venture outside things like this and you’re still using for-loops, be extremely careful.


Thanks to our sponsors!

dbt: Your Entire Analytics Engineering Workflow

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123