Stitch >> Talend. Functional Data Engineering. Don't be a Generalist. featexp. [DSR #161]

❤️ Want to support us? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

$1.9 billion big data company Talend is acquiring Stitch for $60M

Wow! This was the surprise of the week for me.

As many of you know, the Data Science Roundup was originally started within Stitch, and when I left to found Fishtown Analytics Stitch graciously allowed me to continue publishing it. We use the product extensively in our consulting work and have enjoyed watching it grow and improve over the past few years. As such, we have been and remain heavily invested in Stitch’s success.

Many Roundup readers are Stitch users, and in the past few days I’ve repeatedly gotten the question “What does this mean for the future of Stitch?” I figured I’d share my personal thoughts here. Note that I have no insider knowledge: this is just my read as an industry observer.

First, here’s Jake Stein (Stitch CEO) posting to dbt Slack this week:

Stitch is going to be an independent business unit at Talend, and our product will have the same focus as today. We’re going to staff up our team as well, so expect a lot more good stuff from us in the months and years to come.

Yeah, but that’s what he would say, right? Thing is, I actually believe this—I don’t think Talend bought Stitch to kill it, I think Talend (originally founded 2005, pre-cloud) needs a play in the cloud ETL space, and they’re buying their way in. Talend has meaningful existing open source commitments, so I think they’re ideologically aligned with Singer, Stitch’s open-source framework. My guess is that Talend plans to use their existing sales channel to accelerate Stitch’s adoption within the enterprise, but the commitment to the existing self-service product is real.

So: if you’re an existing Stitch user, I actually do think this is good news. If you’re making your choice of provider now, I don’t think this news should change your calculus.

Congrats team! 🎉🎉


Functional Data Engineering — A Modern Paradigm for Batch Data Processing

I must apologize: this article was originally posted back in January and somehow I missed it. I came across it this past week and think it’s the single best post I’ve ever read on the design of batch data processing pipelines. Of course, it’s written by the original author of Airflow, so he knows a thing or two about the topic.

I have so much to say about this post, but Maxime says all of it better. Better to just read the post yourself.


Why You Shouldn’t be a Data Science Generalist

I work at a data science mentorship startup, and I’ve found there’s a single piece of advice that I catch myself giving over and over again to aspiring mentees. And it’s really not what I would have expected it to be.

Rather than suggesting a new library or tool, or some resume hack, I find myself recommending that they first think about what kind of data scientist they want to be.

The reason this is crucial is that data science isn’t a single, well-defined field, and companies don’t hire generic, jack-of-all-trades “data scientists”, but rather individuals with very specialized skill sets.

Could not agree more. Great read. The rest of the post goes into detail about the types of data scientists and provides concrete detail about how they differ.


My Secret Sauce to be in Top 2% of a Kaggle Competition

Hmm! This is my favorite library find of the past several months. The “secret sauce” referred to in the title of the post is a library called featexp. It provides some super-convenient data exploration functionality that the author demos to great effect.


Facebook Horizon: An Open Source Reinforcement Learning Platform

Today we are open-sourcing Horizon, an end-to-end applied reinforcement learning platform that uses RL to optimize products and services used by billions of people. We developed this platform to bridge the gap between RL’s growing impact in research and its traditionally narrow range of uses in production. We deployed Horizon at Facebook over the past year, improving the platform’s ability to adapt RL’s decision-based approach to large-scale applications. While others have worked on applications for reinforcement learning, Horizon is the first open source RL platform for production.


OpenAI: Spinning Up in Deep RL

At OpenAI, we believe that deep learning generally—and deep reinforcement learning specifically—will play central roles in the development of powerful AI technology. While there are numerous resources available to let people quickly ramp up in deep learning, deep reinforcement learning is more challenging to break into. We’ve designed Spinning Up to help people learn to use these technologies and to develop intuitions about them.


Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to build analytics teams. Whether you’re looking to get analytics off the ground after your Series A or need support scaling, let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123