Discover more from The Analytics Engineering Roundup
Common-Sense Baselines. Hawaii. Data Engineering. Getting Hired. Predicting Churn. [DSR #120]
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
I love this post. It asks the question: how would you solve this problem if you knew zero data science? It then posits that this solution is probably good enough to get you a good starting answer. This is your common sense baseline. Build this before doing any advanced modeling.
I couldn’t agree more. This is a must-read.
The more experienced I become as a data scientist, the more convinced I am that data engineering is one of the most critical and foundational skills in any data scientist’s toolkit.
The author, now @ Airbnb and previously @ Twitter, shares the best introduction to data engineering I’ve seen. My favorite section is on the foundational choice of writing ETL jobs in the JVM (Java / Scala) vs writing them in SQL. His stated preference for SQL is one I share, and it’s why we invest heavily in our open-source product, dbt. dbt is a tool to develop and run DAGs of SQL data transformations.
There are emerging standards for how data engineering should be done and this post is a great intro.
By the end of 2019 the scientific stack will stop supporting Python2. As for numpy, after 2018 any new feature releases will support only Python3. To make transition less frustrating, I’ve collected a bunch of Python 3 features that you may find useful.
If you’re still on 2.7, it’s time to make the switch. Get ahead of the hard deadline.
So here's my postmortem after hunting for a data science job.
Max Woolf landed a job as a data scientist at BuzzFeed and weighs in on how his search went. 10-tweet stream; highly relevant if somewhat discouraging.
(Aside: very cool that Revue now has this awesome tweet embed. Forward along relevant tweets and I’ll include them!)
Eric Mayefsky, head of data science at Quora, has assessed hundreds of job candidates in his half decade in management at various tech companies. (…) What he’s learned from his experiences on both sides of the table can help other data science leaders navigate the chasm of assumptions between interviewee and interviewer, and make more effective hires.
Choosing a machine learning library to solve predictive use cases is easier said than done. There are many to choose from, and each have their own niche and benefits that are good for specific use cases.
The author knows his stuff; he’s PM at Salesforce Einstein. The post is an excellent overview of how to think through this important decision.
We’ve been thinking a lot about churn prediction modeling recently—it’s a topic that keeps coming up over and over again with our clients. We’ve played around in this problem space before but are planning to go much deeper in the near future.
There were three articles that I came across as a part of this research that I wanted to share. The first, linked from the title, is a deep-learning approach to churn prediction using Keras. The article is detailed and excellent. The second uses a logistic regression + random forest ensemble that it found to be more effective than a decision tree model. The third recommends a random forest approach as well.
Which approach is best and under what conditions? As of today I have no idea. I was a little surprised that deep learning would be employed as a solution, but I’m trying to keep an open mind. I’ll have more to share on this topic soon.
Data viz of the week
Sorry for the slightly NSFW topic, but this was the data viz of the week this past week. Pornhub released its traffic stats (safe link!) for the period immediately surrounding Hawaii’s false alarm emergency alert.
There is plenty of amusement to be had out of this chart and I’ll let you draw your own conclusions on that front. I only want to point out that this chart is absolutely made by the annotations: the times at the bottom, the callouts for high and low points, and the “!” icon on the X axis. These callouts help the reader immediately key on on the narrative.
If you’re an analyst: use annotations to add context and tell your story. If you work at a BI company and your product doesn’t support annotations: add them.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123