Career Advice. Scaling a Mature Data Pipeline. Deepfake Detection. Presto. Logs. [DSR #199]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Erik Bernhardsson previously ran analytics @ Spotify and now runs engineering and product at Better. His blog is fantastic, and this post contains some of the best advice for people in any kind of technical role that I’ve ever seen. I agree with 100% of it.
I found this post fascinating. The author is making the point that in mature data engineering pipelines there is meaningful overhead associated with doing a bunch of things that are not actually executing computation: spinning up environments, disk I/O, etc. The scale of this problem experienced by the Airbnb Payments team was striking to me: 2 hours of overhead to run a DAG that processed almost no data. Yikes.
Regardless what type of data processing you’re doing, overhead is the enemy. Every bit of overhead introduces friction and should be consistently a focus for optimization. We deal with this every day in the maintenance of dbt: our compilation time is the overhead in the system, as each interactive run today requires an entire recompilation of an entire project. This is a problem for the workflow, and is one of the reasons we’ve spent a tremendous amount of time over the past several months building partial parsing, the ability to only re-parse parts of the DAG that have changed since the last compilation. This cuts the time-to-first-model-build by 90-95%. (If you’re a dbt user, this is going into GA within the next month-ish and it’s going to be a big deal.)
The post presents solutions that the Payments team went through to address the issue, and I think they’re interesting. IMO, though, every solution will be different—the bigger point is that overhead is the silent killer. Measure it, crush it.
Like any transformative technology, this has created new challenges. So-called “deepfakes"—produced by deep generative models that can manipulate video and audio clips—are one of these. Since their first appearance in late 2017, many open-source deepfake generation methods have emerged, leading to a growing number of synthesized media clips. While many are likely intended to be humorous, others could be harmful to individuals and society.
Google is investing in / hoping that 2020 will not be the election year where deepfakes make a big splash. We will see—I’m not sure that detecting a deepfake actually mitigates much of its impact. False rumors didn’t just stop spreading on the internet because of Snopes.
Uber is honored to join the Presto Foundation, a new initiative hosted by the Linux Foundation, to advance the open source data processing community.
Presto is continuing to be in the news. I’ve linked to a bunch of posts covering it in recent weeks, as it forms the backbone of SQL data processing inside of many at-scale tech companies. I’m interested in this announcement about the foundation because I think it’s likely that the increased governance could lead to further innovation and maturity within the platform. Very good news.
If the first ten years of data science were all about collecting and analyzing everything, the second ten will be about how to be deliberate and selective about collected and analyzed data.
I don’t know that I agree with this perspective, but it’s an interesting and contrarian one. My thought process around this gets pretty philosophical pretty quickly, so I’ll spare you most of it. I do think it is quite legitimate for companies to use data as a source of competitive advantage, the problem today is that companies are essentially saving logs with equivalent care as companies of yesteryear storing passwords in plain text. Differential privacy should be a requirement: the value of this data shouldn’t be in its ability to map to individuals, but rather in its ability to generate insights about populations.
Anyway…IMO logs aren’t going anywhere, but they will likely look different in the future.
Not useful (unless you’re in art restoration), but fascinating! Neural style transfer is good for something other than making amusing filters :)
I do not think this post will present fundamentally new ideas to most readers of this newsletter. However, I actually do think the image (above) is a useful depiction of the various components of a data science effort. Maybe bookmark this post for the next time you need a diagram like this for a deck?
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123