AI and Efficiency. 25 ML Best Practices. Data Quality @ Uber. Beekeeper. The Fragility of ML. [DSR #226]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

AI and Efficiency

AI and Efficiency

We’re releasing an analysis showing that since 2012 the amount of compute needed to train a neural net to the same performance on ImageNet classification has been decreasing by a factor of 2 every 16 months. Compared to 2012, it now takes 44 times less compute to train a neural network to the level of AlexNet (by contrast, Moore’s Law would yield an 11x cost improvement over this period). Our results suggest that for AI tasks with high levels of recent investment, algorithmic progress has yielded more gains than classical hardware efficiency.

Emphasis added. This is something that frequently is under-appreciated—the past couple of years have seen a slower pace of massive new benchmarks in image processing, but algorithms have continued to improve along a different dimension: efficiency. This trend is important.


25 ML Best-Practices

This author went back to a classic Google paper from 2015 and updated it for 2020. The original paper was fantastic, and this update is just as good—there’s just so much gold in there. It’s long; hope you’re ready to learn.


Uber: Monitoring Data Quality at Scale with Statistical Modeling

Conventional wisdom says to use some variant of statistical modeling to explain away anomalies in large amounts of data. However, with Uber facilitating 14 million trips per day, the scale of the associated data defies this conventional wisdom. Hosting tens of thousands of tables, it is not possible for us to manually assess the quality of each piece of back-end data in our pipelines.

To this end, we recently launched Uber’s Data Quality Monitor (DQM), a solution that leverages statistical modeling to help us tie together disparate elements of data quality analysis. Based on historical data patterns, DQM automatically locates the most destructive anomalies and alerts data table owners to check the source, but without flagging so many errors that owners become overwhelmed.

Data quality and data catalogs are, IMO, the most interesting areas in data right now. The lack of a good catalog and a good data quality management system becomes a problem for organizations specifically because of the scale of their investment in data. As organizations become more mature and have more existing data assets, these become recurring themes.

This post from Uber is a good walkthrough of a DQM system, what it can do, how it works… Something like this is coming to your team in the not terribly distant future.


Beekeeper Studio

Free & Open Source SQL editor and database manager for MySQL, Postgres, SQLite, SQL Server, and more.

Hmmm! This project has started to get some traction recently. It’s mission:

…to improve technology accessibility by providing a free and open SQL editor and database manager that is full-featured and easy to use.

I’ve always been a little bit surprised that there aren’t better options for OSS SQL clients. It’s such a broadly applicable tool category and the good options have historically been either a) single-client, b) proprietary, or c) unpleasant to use. There has been a lot of momentum around Beekeeper of late (check out the repo) and I’m interested to follow its evolution. As of today it doesn’t support the world of analytical databases, but that’s just a matter of time (and some new adapters).


Our weird behavior during the pandemic is messing with AI models

Machine-learning models trained on normal human behavior are now finding that normal has changed, and some are no longer working as they should.

This…is not at all surprising. The article gives some interesting examples of where models are producing bad predictions, which are certainly interesting to read through and think about.

What I think is more interesting though is that what capabilities we lose when we outsource a decision-making process to ML today. Sure, we might gain increased predictive power under a normal range of inputs and we absolutely gain an increase in granularity and a reduction in decision-making time and cost.

But we also completely lose the ability to think causally. As Judea Pearl has so clearly pointed out in his work of late, modern ML doesn’t think causally. And when the world sees a massive change, the only type of reasoning that can continue to make predictions about the future is causal.

As such, the more we migrate our supply chains, financial flows, etc. to baking ML into their cores, the more fragile we make them to systemic shocks. This was a new thought for me.


Algo Hour: A public seminar series from Stitch Fix

Our Algorithms team has a weekly seminar which we call Algo Hour. We’re excited to tell you about a recent change: we will start allowing people to join in on some of these seminars, and we will also publish them on the web for later viewing.

Neat! I watched the first talk, just posted this week, and it’s quite good. It could be worth plugging this into your continual learning process in the coming months as the work-from-home regimen continues.


Thanks to our sponsors!

dbt: Your Entire Analytics Engineering Workflow

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123