Pizza. Clustering algorithms. An Overview of Cloud ML Options. Regularization. [DSR #122]

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

The 5 Clustering Algorithms Data Scientists Need to Know

The 5 Clustering Algorithms Data Scientists Need to Know

You’re definitely familiar with k-means, but how about the mean-shift and DBSCAN? The best clustering algorithm for your use case actually depends on the shape of your data. This article does a great job of walking through the different algorithms and ends with a really wonderful visualization summarizing the tradeoffs.

Clustering is a common task; make sure you’re making the right choices when you apply a clustering algorithm.


Are You Setting Your Data Scientists Up to Fail?

There are far too few good data scientists out there, and they command high salaries. They may very well be the key to more-efficient operations, new customer insights, and revenue growth. Invest time into getting them in the right spots and managing them properly.

This article might be a useful one to share within your organizations—your managers and executives could likely benefit. Here’s my favorite quote, and a thought that’s counterintuitive to managers not typically working with researchers:

Encourage data scientists to “pull on a thread.” Serendipity lies at the root of many great discoveries, and data scientists, since they see so much data, are uniquely positioned to find these threads. Ask data scientists what else they’re learning, what surprised them, and what just doesn’t look right. Most threads won’t lead anywhere, but help data scientists develop a knack for identifying those with the most potential and the courage to follow up.


What is the Pizza Capital of the US?

What is the Pizza Capital of the US?

This is completely useless but definitely fun. The rankings come from a count of restaurant visits, not a count of restaurants, which makes it more reflective of local preferences.


Comparing Machine Learning as a Service: Amazon, Microsoft Azure, Google Cloud AI

A complete and unbiased comparison of the three most common Cloud Technologies for Machine Learning as a Service.

This is a surprisingly in-depth review of the Amazon, Google, and Microsoft’s offerings in cloud ML. It goes through core ML tasks like classification and regression and continues into specific services in speech, text, image, and video processing.

I don’t have direct experience deploying solutions like Amazon’s Polly in production applications, and I’m somewhat conflicted about this strategy. I’m very comfortable outsourcing compute and storage to a cloud provider, but outsourcing core algorithms to a black-box service worries me as a strategy for anything that’s going to be core to a product. There’s nothing fundamentally preventing you from switching from S3 to GCS, but can you really switch speech-to-text providers in the same way? If not, you subject yourself to serious vendor lock-in in a core part of your application.

I’m curious to hear any thoughts you have on this topic.


Introduction to AWS for Data Scientists

In this post, we give an overview of useful AWS services for data scientists — what they are, why they’re useful, and how much they cost.

If you’re practicing in the field and don’t have a working knowledge of AWS (of the other major cloud services) you’re significantly limiting your impact. This article is a good starting point.


Avoid Overfitting with Regularization

File this under: things you should definitely know.

There must be something automatic to tell us which degree will fit the data and tell us which features to penalize to get the best predictions for unseen data. This is regularization. Regularization helps us to select the model complexity to fit the data. It is useful to automatically penalize features that make the model too complex.

If you’re not familiar with regularization, this is a must-read. And if you want to know how to implement it in Python, this is a solid overview.


Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123