Data Engineering. JupyterLab. Two Great Stitch Fix Posts. Neural Networks + Ethereum? [DSR #125]

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

The Week's Most Useful Posts

A Beginner’s Guide to Data Engineering — Part II

I recently linked to Part I of this guide, and Part II is even better. It covers designing star schemas, partitioning data (an oft-overlooked topic!), Airflow design patterns, and ends with a top list of best practices.

This is an absolute must-read—probably the single best data eng post I’ve ever read. I can’t add anything; just read it.


JupyterLab is Ready for Users

JupyterLab is Ready for Users

We are proud to announce the beta release series of JupyterLab, the next-generation web-based interface for Project Jupyter.

This is a big deal. Jupyter has millions of users, 1.7mm public notebooks on Github, and support for over 100 languages. While there are many notebook products today, the sheer size of Jupyter’s community means that it matters.

I haven’t had a chance to play around with JupyterLab yet but plan to in the near future; would love to hear your thoughts / reactions.


Stitch Fix: Become A Full Stack Data Science Company

Lots of good insights in here on Stitch Fix’s model of hiring / training “full stack data scientists"—data scientists that can create a mathematical model, write production code, and maintain the model in production. While it’s probably not possible to hire a whole team of people who can do everything, it is possible to train new hires across the entire stack.

Also, completely agree on explicit data team representation at at the C level:

When data science is your company’s competitive edge, having a CAO represent data science at the executive level is more effective than being represented by an engineering head, like a CTO. The CAO has a deeper and more nuanced understanding of data science(…)


Stitch Fix: What Do Data Scientists Need to Know about Containerization? As Little as Possible.

Bunch of good stuff from Stitch Fix this week. This post announces a new open source project called Flotilla:

Today we’re excited to introduce Flotilla, our latest open source project. Flotilla is a human friendly service for task execution. It allows you to focus on the work you’re doing rather than how to do it. In other words, Flotilla takes the struggle out of defining and running containerized jobs.

Related to the above post, this is the type of tooling that allows data scientists to deploy production code without investing all of their time in devops.


Facebook Research: Announcing Tensor Comprehensions

[Tensory Comprehensions] will allow researchers and programmers to write layers in a notation that is similar to the maths they use in their papers and communicate concisely the intent of their program. They will also be able to take that notation and translate it easily into a fast implementation in a matter of minutes rather than days.

Really cool mechanism to generate fast code from high-level network descriptions.


Trustless Machine Learning Contracts: Evaluating and Exchanging Machine Learning Models on the Ethereum Blockchain

Algorithmia just implemented a neural network in Solidity, the Ethereum scripting language. Is it a Good Idea to have a neural network running on-chain? I’m not smart enough to answer that question, but it certainly is novel and noteworthy. This paragraph was amusing:

And with a moment’s notice, 22 thousand machines ran the first neural network on the Ethereum blockchain. What looked like machine code to these everyday miners, was actually a fully functioning neural network. Feb 15th was a good day.

The future? Or too many buzzwords?


Hiring Data Scientists Step 1: Stop Looking for Data Scientists.

We are looking for someone to fill an upcoming gap in our business model. We are not exactly sure what you will be doing, but we are sure that our shareholders will love the idea that we have Data Scientists. You will report to someone that does not understand what you do, and you will often be met with skepticism when you present your solutions to management.

Amusing, poignant. I think the article overstates the point just a bit (technical skills are important!) but it’s a worthwhile and memorable read.


‘Big data’ classes a big hit in California high schools

About 30 high schools in California have started offering data science classes for juniors and seniors, in some cases as an alternative to Algebra 2.

The class has been popular with students. In addition to the class having a low attrition rate, 82 percent of students said they’d recommend it to a friend.

I’m not sure about replacing Algebra 2(!?), but it’s great to see students getting exposure to these topics earlier.



Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123