Discover more from The Analytics Engineering Roundup
The 2018 Data Landscape. Building Services. Red Flags in Interviews. AutoML. [DSR #145]
Wow—apparently I decided to take a break on a busy week! There’s a lot of great stuff in this issue, and I’ve saved a bunch more for next week. Enjoy :)
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This Week's Most Useful Posts
Matt Turck, of FirstMark Capital, produces the most authoritative guide to the industry every year. This is his 2018 post, complete with the standard logo zoo. The post is awesome, insightful, but I have a different take on the maturity of data infrastructure & analytics:
As the cycle of replacing older IT technologies with more modern data products continues, it seems that the Big Data market (infrastructure, analytics) is cycling through the early majority of buyers and transitioning into the late majority of the traditional adoption curve.
This is where we spend all of our time and energy operating, and I see almost universally poor adoption of new technologies. Companies have increasingly adopted modern data infrastructure and analytics tech, but there is limited knowledge among practitioners about how to use these technologies effectively. As a result, businesses are getting significantly less value from their investments then the technologies make possible.
My view is that there is an entire generation of tools that need to exist to help practitioners actually use the technology infrastructure that they’ve been given access to over the past decade. And most of these tools either don’t exist yet or are in their nascency. The underlying technology shift has created a new paradigm, and we’re still figuring out how to use it effectively. This is far from over.
Here is our list of 12 signs the company you are interviewing with for a data scientist job should be avoided (and the questions to ask during the interview).
Practical, much-needed advice. Many data scientists spend most of their job search time attempting to get a job, and don’t spend enough time thinking about if a job is one they should take. You need to find out as early as possible if the organization you’re interviewing with is going to help you grow or condemn you to boredom.
I love this post! Here’s the intro:
In order for data scientists to be effective at a startup, they need to be able to build services that other teams can use, or that products can use directly. For example, instead of just defining a model for predicting user churn, a data scientist should be able to set up an endpoint that provides a real-time prediction for the likelihood of a player to churn.
Ok, yes, I definitely agree with that. But where does the post go from there? Oh, Idunno, how about an entire walkthrough of how to use AWS Lambda to create such a service?
There are so many “How to do X thing in Python” tutorials out there, but it’s rare to see a detailed tutorial on the practical stuff that will differentiate you as a data scientist. Thanks to Ben Weber of Zynga for a great post.
This is a post I’ve been hoped someone would write. There was a lot of commotion about automating machine learning after the recent announcement of Google’s AutoML. The press storyline was that products like this would make machine learning expertise unnecessary, which seemed obviously misleading to anyone who had any sense of how machine learning actually works in practice.
This article demystifies the entire area of automated hyperparameter selection and neural architecture search. It goes deep into various approaches that researchers have taken on the problem since 2013, and the modern approaches of evolutionary or reinforcement-learning-based approaches. It also talks about why these problems are actually a fairly small part of the overall ML process, and why transfer learning is often a superior approach.
If you’ve been following AutoML with any interest, this is the best soup-to-nuts analysis piece that I’ve seen. It’s a useful technology, but don’t buy the marketing :)
This post, by senior Googler Cassie Kozyrkov, focuses on the difference between ML research and applied ML. Her diagnosis is that ML research gets too much focus and there isn’t enough training in applied ML:
Unfortunately, I see a lot of businesses failing to get value from machine learning because they don’t realize that the applied side is a very different discipline from the algorithms research side. Instead, leaders try to start their kitchens by hiring those folks who’ve been building microwave parts their whole lives but have never cooked a thing.
This is one of the most widely-shared posts I’ve seen recently at over 10k Medium claps today. Apparently it hit a nerve.
New research from Yoshua Bengio and team. Their summary is excellent:
Many real-world problems require integrating multiple sources of information. Sometimes these problems involve multiple, distinct modalities of information — vision, language, audio, etc. — as is required to understand a scene in a movie or answer a question about an image.
When approaching such problems, it often makes sense to process one source of information in the context of another; for instance, in the right example above, one can extract meaning from the image in the context of the question. In machine learning, we often refer to this context-based processing as conditioning: the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input.
Finding an effective way to condition on or fuse sources of information is an open research problem, and in this article, we concentrate on a specific family of approaches we call feature-wise transformations.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to build analytics teams. Whether you’re looking to get analytics off the ground after your Series A or need support scaling, let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123