Data Science Roundup #75: Diagnosing Cancer with Deep Learning, Open Source ETL with Singer, and more!
Has anyone deployed Facebook Prophet yet? We’re going to start using it and would love to hear any feedback this community has. I’ll definitely share all feedback with the group!
PS: Referred by a friend? Sign up here!
Two Posts You Can't Miss
Stitch just released a brand new open source platform for ETL called Singer. With Singer, you have access to a large number of “taps” (data extractors) and “targets” (data loaders) that you can run on your own infrastructure free of charge, or you can run in Stitch’s cloud with zero maintenance burden.
At Fishtown Analytics, we recently built two Singer taps for our clients and plan on adopting the Singer platform for 100% of our custom ETL work. Highly recommended if you’re thinking about moving data from one place to another.
There are more than a few teams thinking about how to deploy deep learning advances in cancer diagnosis and treatment, but these recent results from the Google Research team stand out:
…the prediction heatmaps produced by the algorithm had improved so much that the localization score (FROC) for the algorithm reached 89%, which significantly exceeded the score of 73% for a pathologist with no time constraint. We were not the only ones to see promising results, as other groups were getting scores as high as 81% with the same dataset. Even more exciting for us was that our model generalized very well, even to images that were acquired from a different hospital using different scanners.
This Week's Top Posts
These cars learned how to drive by themselves. They got feedback on what good and what bad actions are based on their current speed as a form of reward. Powered by a neural network.
Definitely silly, but really freaking cool. Take a minute and play with it.
As we collect more data from the world around us, surveys become relatively less important as real-time methods plus sophisticated analysis get us there faster (and cheaper). Really interesting results.
Want the authoritative history of deep learning? This is it. Just published.
This paper is a review of the evolutionary history of deep learning models. It covers from the genesis of neural networks when associationism modeling of the brain is studied, to the models that dominate the last decade of research in deep learning like convolutional neural networks, deep belief networks, and recurrent neural networks, and extends to popular recent models like variational autoencoder and generative adversarial nets.
Not all PhDs should pursue a career in data science.
Despite the rather awkward title, this is the best advice on this topic that I’ve read.
Time series modeling sits at the core of critical business operations such as supply and demand forecasting and quick-response algorithms like fraud and anomaly detection. Small errors can be costly, so it’s important to know what to expect of different error sources. The trouble is that the usual approach of cross-validation doesn’t work for time series models. The reason is simple: time series data are autocorrelated so it’s not fair to treat all data points as independent and randomly select subsets for training and testing. In this post I’ll go through alternative strategies for understanding the sources and magnitude of error in time series.
Really, really important.
NPS data is very useful when combined with the rest of your data in your warehouse! I’ve recently gotten a chance to play with NPS-based customer segmentation and from what I’ve seen, NPS is extremely predictive of future customer behavior. This article is an awesome resource for playing around with NPS data in SQL.
If you’re looking to load NPS data into your warehouse, check out the Wootric Singer Tap.
This is a really impressive post. If you’re new to the world of notebook-based analytics, this is the best overview of the subject I’ve seen. The post goes quite deep into the history, usage, and ecosystem of notebooks. Useful.
While R’s base graphics library is almost limitlessly flexible when it comes to create static graphics and data visualizations, new Web-based technologies like d3 and webgl open up new horizons in high-resolution, rescalable and interactive charts.
This has been a major drawback of R visualization for a while, and the gallery examples look impressive.
Data viz of the week
Such an interesting result! Asking the right question is critical.
Thanks to our sponsors!
Fishtown Analytics works with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123