Discover more from The Analytics Engineering Roundup
3 Ways to Screw Yourself With ML, Data PMs, & More! [Data Science Roundup #94]
Several links this week to Towards Data Science—they’re doing great work, check them out.
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
A Post You Can't Miss
This is really one of my favorite articles on ML ever. There are so many walkthroughs on how to throw a bunch of code together that roughly accomplishes a goal; there are far fewer guides to how to not screw it up (that requires a knowledge of the Real World). That gets you here:
You end up with the project where the metrics randomly jump up or down, do not reflect the actual quality, and you are not able to improve them. The only way out would be to rewrite the entire project from the scratch. That is when you know — you shot yourself in the foot with a bazooka.
The post touches on three particularly common areas of technical debt: feedback loops, correction cascades, and “hobo features”.
This Week's Top Posts
“The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’” - Isaac Asimov
I’m always surprised at how many people ask this question. This short, straightforward post does an amazing job at explaining the analytic process.
The more time we at Fishtown Analytics are spending on data science, the more interested I get in all of the non-algorithmic parts of the process. This just-released post summarizes it incredibly well:
Building and optimizing the predictor is easy. What is hard is finding the business problem and the KPI that it will improve, hunting and transforming the data into digestible instances, defining the steps of the workflow, putting it into production, and organizing the model maintenance and regular update.
So good. Every product is a data product, and classical UX skills need to be augmented (quickly) with much deeper data skills in our PMs.
Working with data at the core of a product requires a level of understanding of data modeling, data infrastructure, and statistical and machine learning. It goes beyond understanding the results of experiments and reading dashboards — it requires a deep appreciation for what is possible and what will soon be possible by taking full advantage of the flow of data.
This is an impressively exhaustive directory of the different components of the data engineering stack. Click on the different parts of the stack to engage with it.
I found this useful to find and explore gaps in my own knowledge.
I avoid most “X Things” posts; this one is worth reading. I’ve made a couple of these mistakes myself…
The author has built an impressive set of benchmarks comparing Theano, TensorFlow, and CNTK, running on three different GPUs. His summary:
The accuracies of Theano, TensorFlow and CNTK backends are similar across all benchmark tests, while speeds vary a lot.
Relevant if you’re making production decisions today, but potentially more so to follow the evolution of the space. In other high-level languages, the broad trend is to sacrifice execution efficiency for programmer efficiency. With the intense computational needs of deep learning, it’s not clear that things will play out the same way.
This is insanely cool. The push for open data in government has been happening for a while, with plenty of cool results. Frequently, though, CSV datasets reside on dingy webservers waiting for an analyst to download them with an R script. This is certainly better than nothing, but it’s far from open data living up to its true potential. Open data needs to be active, to be integrated into our lives, for that to happen.
This If This Then That project does just that: it allows people to turn open datasets into active tools. This is just a glimpse into the beginning of a very big trend.
Data viz of the week
Every solar eclipse in your lifetime. Uniquely suitable.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123