This was a big week! So much good stuff I couldn’t narrow it down to 6. Enjoy :)
PS: Referred by a friend? Sign up here!
New Software Releases: Don't Miss
Since its release, TensorFlow has been used in over 6,000 open source projects. Faster, more flexible, more stable, TF 1.0 will only accelerate usage.
If you only read one post about TensorFlow, this is the one. There’s tons of info and links about the progress of one of the most critical packages in deep learning.
Most data tech in production is pull-based: you have to go looking for an answer. Notifications and stream-based analysis are topics with a lot of interest, but significantly less deployment. With Airbnb having made this investment, hopefully many more companies will have the leverage they need to get serious about real-time.
Highly recommended if you are (or will be) considering a project in this area.
Data Science Articles
If you plug and play ML models without understanding the math under the hood, you’ll make really meaningful mistakes. Choose the wrong algorithm. Choose the wrong hyperparameters. Underfit. Overfit. Mis-estimate your confidence intervals. Pain and suffering will ensue.
It’s dangerous to go alone! Take this.
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.
This overview will cover several methods of detecting anomalies, as well as how to build a detector using simple moving average (SMA) or low-pass filter.
I did some fun anomaly detection this past week—detecting website traffic anomalies caused by TV advertising. This stuff is fun.
This post is awesome. If you still haven’t made the jump from Excel to R in your day-to-day, read this. It highlights why the jump is actually quite hard to make, and what the rewards are once you’ve made it.
There are hordes of Excel users out there; I’m fascinated by the problem of getting these users to learn and use more sophisticated tech.
Do you know what stemming and lemmatization are? No? You may not have had to tackle any NLP yet, but there’s no way you’ll be able to stay away from it for long. There’s just too much text out there. This is a solid intro to familiarize you with the key concepts.
Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not. In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.
This is an impressive deep dive: the post walks through the entire analysis, including plenty of narration on strengths and weaknesses of their approach.
This post is a bit wordy, but the data is great. If you’re currently evaluating data science programs, take a look at the conclusions.
US Open Data had been a really valuable source of data for citizens and journalists to understand their country. Gone for good? Who knows.
Always good to end on a cheerful note, right?
Data viz of the week
Amazing summary, leaves you with everything you want to know.
Thanks to our sponsors!
Fishtown Analytics works with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123