Deep Learning in a Spreadsheet. Jobs! Data viz. The Spark Optimizer. [DSR #123]
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
Looking for a job? Check out these postings.
The community of data analysts, engineers, and scientists around dbt is growing quickly. In our public Slack channel there are more and more awesome companies looking to hire top-tier candidates; I figured I’d post a few current openings here.
We have personal relationships with each of these companies, so if you’re interested in any of the openings feel free to reach out if you’d like an intro.
The week's most useful articles
This is absolutely brilliant.
I want to show you that Deep Convolutional Neural Nets are not nearly as intimidating as they sound. And I’ll prove it by showing you an implementation of one that I made in Google Sheets.
I’m sure you’ve read a dozen deep learning explainers, but actually demonstrating the concepts in a spreadsheet is so much more instructive than writing english sentences. Spend a few minutes with this even if you’ve used CNNs before—the spreadsheet is wonderful at building intuitions.
Elijah Meeks, one of the foremost data viz experts in the business, recently responded at length to a question tweeted at him: “Can you explain to me what a Senior Data Visualization Engineer exactly does?”
Elijah has written about the role before and has gotten push back—people essentially saying “you’re just a UI developer you can’t really expect me to believe you spend 100% of your time making pie charts”. This post is his response to that perception.
I’m interested in the specific topic discussed here, but also in the more general trend: roles on the modern data team are still in the process of being figured out. It used to be that everyone was an analyst or a DBA or a statistician. Now we have data engineers, data scientists, data analysts, and visualization engineers. These are not trivial rebrandings of the old titles: they actually are different jobs.
It seems obvious to me that if your organization requires advanced data visualizations that hiring specialists in this area is a must. We’ve recently been thinking about hiring someone like this for the Fishtown team as we’re getting more requests for this type of work.
Speaking of data viz:
Most of the time, the data you work with is not complete. There is missing data. Available values can be sparse across time and space the farther out you stretch. What do you do when this happens?
This is a wonderful guide on how to respect missing data in your visualizations. Hint: just removing nulls from your result set is rarely a good answer 😀
Google has been bragging about its TPU chip since May of 2016 and has finally released it into GCP this past week. If you have any active TensorFlow workloads, you should absolutely experiment with it. Currently you have to actually request a quota—they’re managing limited supply vs. large demand—so if you’re interested, fill out the form ASAP.
The fundamental idea of Bayesian inference is to become “less wrong” with more data.
If you’re new to Bayes’ rule, this is a great intro. This is an absolutely fundamental concept that shows up everywhere you look.
The ability to write a production-level code is one of the most sought-after skills for a data scientist role.
100% agree. Most posts on this subject don’t actually help data scientists think about how to write better code; instead, they give some easy answer like “use XYZ product” or “wrap your code behind a RESTful API”. These suggestions are not wrong, per se, but they’re certainly inadequate.
The correct answer to “how do you write production code?” is: write better code. Modularize and organize your components. Document well. Focus on readability. Discuss integration requirements with consumers of your algorithms.
This post tackles the subject head-on. While this is not a topic that a single blog post can fully address, this is the best overview that I’ve seen.
I really think that understanding query optimizers is a super-important skillset, and one that is all-too-rare. This post delves into optimizer implementation in Spark and how it can produce novel and dramatic performance improvements. I learned a lot from this post.
The vast majority of big data SQL or MPP engines follow the Volcano iterator architecture that is inefficient for analytical workloads. Since Spark 2.0 release, the new Tungsten execution engine in Apache Spark implements whole-stage code generation, a technique inspired by modern compilers to collapse the entire query into a single function. This JIT compiler approach is a far superior architecture than the row-at-a-time processing or code generation model employed by other engines, making Spark one of the most efficient in the market.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123