Google Cloud Next. Learning Curves. Plotly Express. XGBoost. Snowpipe. [DSR #182]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

Google Launches an End-to-End AI platform

As expected, Google used the second day of its annual Cloud Next conference to shine a spotlight on its AI tools. The company made a dizzying number of announcements today, but at the core of all of these new tools and services is the company’s plan to democratize AI and machine learning with pre-built models and easier to use services, while also giving more advanced developers the tools to build their own custom models.

So many announcements at Cloud Next. I’m not sure I felt a burning desire for most of these products; my guess is that lots of the demand here is coming from the enterprise where there is a premium placed on packaging up existing open source solutions. Regardless, it’s very important to stay aware of what’s available from the hyperscale cloud providers, highly recommended reading.


A Gentle Introduction to Learning Curves

A Gentle Introduction to Learning Curves

(…) a learning curve is a plot that shows time or experience on the x-axis and learning or improvement on the y-axis.

The above learning curve, for example, shows a model that has been overfit: the model continues to better fit the training dataset while showing no improvements on validation. This comes through very clearly in the plot and demonstrates the power of learning curves as a tool for ML diagnosis.

Simple, powerful.


Introducing Plotly Express

Our main goal with Plotly Express was to make it easier to use for exploration and rapid iteration.

Plotly was already, IMO, the most flexible / powerful charting library out there. If you had used its Python API, though, you know that it was rather arduous. It helped you build the perfect visualization, but it was too verbose to iterate quickly between a host of options. As a practitioner, you likely find yourself doing quite a lot of the latter.

Take a look at the examples in the post. Same powerful charting framework, but now it plugs directly into a tidy dataframe and allows configuration in a mechanism that you’ll be familiar with if you’ve used ggplot2 or Seaborn. Huge, useful improvement.


The Third Wave Data Scientist

The continuing evolution of the definition of a data scientist is a worthy topic, if rather overwrought. This post comes the closest to my personal perspective on the topic. It’s a short read, but I mostly include it here for the graphic (above). Probably useful in some of your slide decks ;)


An Intro to the XGBoost Algorithm

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.

If you’re not familiar with XGBoost, this is a great intro to a state-of-the-art algorithm. The article itself is an overview and has plenty of links to go deeper.


Setting Up a Data Pipeline Using Snowflake’s Snowpipes

One of Snowflake’s most powerful and user-friendly features is Snowpipe, its mechanism for loading data. It’s surprisingly powerful and surprisingly easy to configure, and once everything is running, it’s surprisingly cheap! Compared to the functionality provided on both Redshift and BQ for data loading, Snowpipe is, in my experience, both easier to use and more powerful.

This post (by good friend of the dbt community Claus Herther) goes through Snowpipe setup from start to finish. If you have a Snowflake account or are evaluating using it, check this out.


Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to build analytics teams. Whether you’re looking to get analytics off the ground after your Series A or need support scaling, let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123