Data Science Roundup #70: The Rise of the Data Engineer, GPUs, and Some Amazing Data Viz
Every week I scan the top news from 140+ blogs and publications and distill 500+ posts down to the best six. Do you like reading the Data Science Roundup? Please share with your network. Your shares are how we grow!
Thanks! 🤣 - Tristan
Referred by a friend? Sign up.
Focus on: Data Engineering
This is a must-read post. Here’s the entire intro:
I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.
I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely.
My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods.
We were pioneers. We were data engineers!
There is a massive amount of knowledge in this post. Read it.
If you’re a data scientist at a large company, or your datasets primarily come to you in CSVs that you store locally on your hard drive, you may not have much of a relationship with anyone calling themselves a data engineer. Give you a Jupyter or RStudio notebook and you’re good to go. So why is there such a fuss about this new role?
Data engineers are particularly critical because of the data environments that have become commonplace at technology companies. It is the complexity involved in the environment of, say, Uber or Facebook that makes data engineering so critical.
This article goes into depth on exactly what such a complex data environment looks like and does an excellent job discussing how data engineers create solutions in environments like these.
Focus on: Business People!
There are two primary sets of people who read the Data Science Roundup: data science practitioners who skew fairly technical, and business people who need to interact with data scientists but are themselves much less technical. Much of the content I link to attempts to target both audiences, but I wanted to take some space today to specifically focus on the latter.
This article is a real gem. Written by an expert in GPU technology (MapD builds a SQL database that is powered by GPUs), it goes back to basics in explaining the fundamentals of GPU-based computing. I actually hadn’t realized that the primary benefit of GPUs is their massively parallel architecture, which is something of a historical accident:
The reason GPUs exist at all is that engineers recognized in the nineties that rendering polygons to a screen was a fundamentally parallelizable problem - the color of each pixel could be computed independently of its neighbors.
Apparently each GPU chip has thousands of cores, and as such is far superior at handling “embarrassingly parallel” operations. Like analytics.
You’re likely familiar with the core concepts behind predictive analytics—choosing your variable, choosing an algorithm, creating test and training sets, etc.—but unless you’ve actually gone through the process yourself, you may feel a bit shaky on the implementation details. If this describes where you are, this article, and its followup, were written for you. With this post and some basic R or Python skills, you have everything you need to build a simple predictive model.
This week's best data science articles
This is one of the very best articles I’ve ever read on the topic of data visualization. It is an absolute must-read. Here’s my favorite section:
As you learn more, you get more choices, which in itself can be a challenge. Resist the temptation to add so many things to your visualization that it obscures the original purpose. That said, don’t use this as an excuse to resist trying new things. You won’t know how far you should go until you’ve gone too far.
Iterate. Practice. Then let the data speak.
I doubt you are an assembler hacker (and I’m not linking to this article because I think you should become one). But the author makes a really fascinating observation: there are elements of deep learning algorithms that are not well-optimized by modern compilers. Because of the massive datasets required by these algorithms, the performance implications of even minor optimizations become very significant.
This feels very much to me like when it was still challenging to accept credit card payments online, circa 1998—a great reminder that we are still at the very beginning of this particular era.
Thanks to our sponsors!
Fishtown Analytics works with venture-funded startups to implement Redshift, BigQuery, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123