Do I need a Data Engineer? ML Trends for 2019. AI in VR @ Facebook. Counting Solar Panels. [DSR #168]

❤️ Want to support this project? Forward this email to three friends!

🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.

This week's best data science articles

Does my Startup Data Team Need a Data Engineer?

My first post in a long time. Here’s the intro:

I find myself regularly having conversations with analytics leaders who are structuring the role of their team’s data engineers according to an outdated mental model. This mistake can significantly hinder your entire data team, and I’d like to see more companies avoid that outcome.

This post represents my beliefs about when, how, and why you should hire data engineers as a part of your team. It’s based on my experience at Fishtown Analytics working with over 100 VC-backed startups to build their data teams, and on conversations with hundreds of companies in the wider data community.

If you run a data team at a VC-backed startup, this post was written for you.

Would very much appreciate any and all comments from this community!


Machine Learning & AI Main Developments in 2018 and Key Trends for 2019

KDNuggets’ survey of experts to on their thoughts. My favorite predictions came from Andriy Burkov who leads the ML team @ Gartner:

1. I expect everybody getting excited about AutoML promise even more than this year. I also expect it to fail (with the exception of some very specific and well-defined use cases, like image recognition, machine translation, and text classification, where handcrafted features aren’t needed or are standard, raw data is close to what the machine expects as the input, and the data is in abundance).

2. Marketing automation: with mature generative adversarial networks and variational autoencoders it is becoming possible to generate thousands of pictures of the same person or paysage with small differences in facial expressions or mood between those images. Based on how consumers react to those pictures, we can generate optimal advertisement campaigns.

The thought of auto-generated images feeding into Facebook campaigns that are automatically A/B tested for reactions with millions of cheap impressions is simultaneously wonderful and horrifying.


DeepSolar: A Machine Learning Framework to Efficiently Construct a Solar Deployment Database in the US

This is cool. The authors built a DL-based solution that takes satellite images and detects solar panels. It’s super-cool to me that you can actually have a database of literally every solar panel installation in the US just by…looking from the sky. Here are some neat conclusions from the research:

(We discovered that) residential solar deployment density peaks at a population density of 1,000 capita/mile2, increases with annual household income asymptoting at ∼$150k, and has an inverse correlation with the Gini index representing income inequality.

Modern data techniques create brand new ways to answer previously unanswerable questions.


Data Science vs Engineering: Tension Points

Data Science vs Engineering: Tension Points

Domino recently had a panel with four excellent guests, and this post does a great job of summarizing it. The intersection of DS / DE continues to be where some the most interesting stuff is happening, and these folks are on the cutting edge. Overall, the panel seems to believe that these two teams are getting closer together, and in some cases are being combined:

…we’ve ended up bringing in people who could bridge data science and engineering. We’ve called the team “product engineering” that includes people who know how to build machine learning models, know how to do data science, have a bit of product intuition, and know how to put things into production.

This is very good news for practitioners trying to solve real problems.


I Worked With A Data Scientist As A Software Engineer. Here’s My Experience.

I really enjoyed this post from a data science outsider. The author is a mobile engineer working on Android, primarily using Java and chronicles a year-long stint working to implement some ML features within a mobile app.

I enjoyed the total outsider’s perspective—the author is technically quite competent, and it’s neat to see what he finds hard and what he finds easy. One of the more interesting things to me was how important language choice is in data science:

I experienced other challenges too, one of them was frequent: translating the Python solutions to Java. Since Python already has built in support for data science tasks, the code felt more concise in Python. I remember pulling my hair out when I tried to literally translate a command: scaling a 2D array and adding it as a transparent layer to an image. We finally got it to work and everyone was excited.

Industry people all know much of a default Python has become, but it can create challenges for outsiders who typically view Python as a second-class citizen (justifiably so in many contexts!). There’s no reason why the author would ever use Python in his day job, and his hesitance to want to learn the language at all was an interesting juxtaposition with how dominant it is in our world.


Facebook: Open-sourcing DeepFocus, an AI-powered system for more realistic VR images

Wow. The easiest way to understand what DeepFocus does is to watch the 27-second video below. Really neat. What’s even cooler is that the team built this high resolution, low-latency product using a CNN. This is the first time I’ve seen a demo of a deep learning application that has been integrated into an interactive experience, and it’s the performance that makes that possible.


DeepFocus Demo Video | Facebook Reality Labs

DeepFocus Demo Video | Facebook Reality Labs

An Intro to Kernel Density Estimation

Kernel density estimation is a really useful statistical tool with an intimidating name. Often shortened to KDE, it’s a technique that let’s you create a smooth curve given a set of data. This can be useful if you want to visualize just the “shape” of some data, as a kind of continuous replacement for the discrete histogram. It can also be used to generate points that look like they came from a certain dataset - this behavior can power simple simulations, where simulated objects are modeled off of real data.

Really excellent visual / interactive explainer. You can absorb the entire concept in a memorable way within 2 minutes. Very worthwhile.


Thanks to our sponsors!

Fishtown Analytics: Analytics Consulting for Startups

At Fishtown Analytics, we work with venture-funded startups to build analytics teams. Whether you’re looking to get analytics off the ground after your Series A or need support scaling, let’s chat.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123