Discover more from The Analytics Engineering Roundup
A Decade in Tech. SciPy. Network Pruning. ML @ Spotify. Coronavirus. [DSR #217]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
As 2019 draws to a close, I wanted to jot down some thoughts on some of the most important technological adoptions and innovations in tech this past decade. I also look a bit into the future, enumerate a list of pain points and opportunities that can be addressed in the coming decade.
This post is fantastic. It’s not focused on data, but rather the software engineering space more broadly, covering trends like containerization, CI/CD, streaming, and lots more. The reason this is interesting to me, and why it also should be interesting to you, is that it tells you what the future in data is going to look like. Application developers tend to get all of the new toys first, and then data folks slightly later. Essentially every technology trend mentioned in this post is also operating in the data space but is more nascent in its adoption.
Perhaps the instance of this that is most interesting to me is observability. There is quite a lot of movement in that space within the broader software engineering ecosystem, and data is just starting to see any movement at all. Generally, observability within data products is extremely poor today. Don’t expect that to persist forever—there are several projects pushing on this.
If you’re not familiar,
SciPy provides fundamental algorithms for scientific computing.
It’s one of the most widely-used packages in the Python data science ecosystem, right up there with pandas, numpy, and scikit-learn. As of this writing,
…over 110,000 GitHub repositories and 6,500 packages depend on SciPy.
This is a fascinating journal article, not quite like anything I’ve ever seen before. It’s a review of the history, the architecture, the structure, and more of the community and the package itself. I enjoyed it because I personally enjoy knowing the history and the stories behind the technology that shapes our current environment. So often, the quirks of the way things are today are related to path-dependence and can only be well understood in full view of their historical contexts. For example, I didn’t know that scikits (like scikit-learn) were evolutions out of core scipy, separated out in an effort to keep scope manageable.
Long, skimmable, unique.
Pruning is something that I’ve seen come up more and more often recently and I find it very interesting. Here’s the first paragraph of the paper that gives a good overview:
We present a filter pruning approach for deep model compression, using a multitask network. Our approach is based on learning a a pruner network to prune a pre-trained target network. The pruner is essentially a multitask deep neural network with binary outputs that help identify the filters from each layer of the original network that do not have any significant contribution to the model and can therefore be pruned. The pruner network has the same architecture as the original network except that it has a multitask/multi-output last layer containing binary-valued outputs (one per filter), which indicate which filters have to be pruned. The pruner’s goal is to minimize the number of filters from the original network by assigning zero weights to the corresponding output feature-maps.
The reason this seems like an obviously good idea is its correlate in the human brain: we know that our brains are constantly pruning less useful connections. I’m very interested in the future of this research.
This is a fantastic overview post of Spotify’s journey with ML. It doesn’t go incredibly deep, but that ends up working nicely. The contrasts between Spotify’s ML challenges and those at Netflix simply based on the inherently different behavioral patterns between music and TV consumption was particularly interesting.
The BlueDot algorithm scours news reports and airline ticketing data to predict the spread of diseases like those linked to the flu outbreak in China.
Very timely, obviously: in our increasingly-connected and urbanized world, outbreaks will only continue to be a bigger challenge. There was a lot of noise about using AI to monitor and predict public health issues earlier in the decade, especially at Google with Flu Trends, which:
…was euthanized after underestimating the severity of the 2013 flu season by 140 percent
I had kind of missed that. I hope that BlueDot is able to do better; Coronavirus has the potential to be a massive global health problem.
A table format for large, slow-moving tabular data
Iceberg came up on a call I had this week and I hadn’t heard about it before. It’s similar to Delta Lake, which I’ve covered before, as well as Snowflake Time Travel. Each of these solutions brings the data warehouse and the data lake closer together by giving certain guarantees on top of files stored within blob storage. This trend is a critical one to understand in the evolution of data warehouse technology.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123