Honesty in Pandemic Modeling. OpenAI Microscope. Analytic Indices? Lossless Image Compression via DL. [DSR #224]
Hi! It’s been a crazy couple of weeks for me—I had a new baby, moved into a new house, and raised venture money for my company (Techcrunch / dbt blog). If you’re ever in the same situation, I’d recommend spacing these life events out a bit more!
Like most of you, I’ve been seeing almost no adult humans IRL. This isolation, the rapid pace of change in my personal life, and the president’s (“sarcastic”) recommendation of injections of disinfectants as a cure for Covid-19 have made me feel like I’m living inside a Picasso. Is this all really happening?!
I have a good set of links lined up for you this week, but I thought I’d also share my mental state beforehand. Talking about anything except for Covid right now seems almost pointless (and your click-through rates reflect this), but I remain undaunted. Let’s talk about data, even if we’re all living in a simulation gone wrong.
- Tristan
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
A Call to Honesty in Pandemic Modeling
Hiding infections in the future is not the same as avoiding them.
Ok, agree. You’ve got me hooked. Where are we going from here?
A keen figure-reader will notice something peculiar in Kristof’s figure. At the tail end of his “Social distancing for 2 months” scenario, there is an intriguing rise in the number of infections (could it be exponential?), right before the figure ends. That’s because of an inevitable feature of realistic models of epidemics; once transmission rates return to normal, the epidemic will proceed largely as it would have without mitigations, unless a significant fraction of the population is immune (either because they have recovered from the infection or because an effective vaccine has been developed), or the infectious agent has been completely eliminated, without risk of reintroduction. In the case of the model presented in Kristof’s article, assumptions about seasonality of the virus combined with the longer mitigation period simply push the epidemic outside the window they consider.
It goes further: the author actually goes into the Javascript that’s creating the simulation and edits it, then publishes the updated numbers with a longer horizon. Really very worth reading. I haven’t seen such an effective teardown of an NYT piece before.
More interpretability news! In my last issue I covered a publication in Distill urging more fine-grained examination of individual neurons, and now OpenAI has released a major new tool helping researchers do just that. I haven’t gone deep with it yet, but have clicked around—I plan on spending more time with it in coming weeks. Curious to hear if you have any thoughts!
Qd-tree: Learning Data Layouts for Big Data Analytics
Indices for analytic databases!?
Technologies such as columnar block-based data organization and compression have become standard practice in most commercial database systems. However, the problem of best assigning records to data blocks on storage is still open. For example, today’s systems usually partition data by arrival time into row groups, or range/hash partition the data based on selected fields. For a given workload, however, such techniques are unable to optimize for the important metric of the number of blocks accessed by a query. This metric directly relates to the I/O cost, and therefore performance, of most analytical queries.
In this paper, we propose a new framework called a querydata routing tree, or qd-tree, to address this problem, and propose two algorithms for their construction based on greedy and deep reinforcement learning techniques. Experiments over benchmark and real workloads show that a qd-tree can provide physical speedups of more than an order of magnitude compared to current blocking schemes, and can reach within 2× of the lower bound for data skipping based on selectivity, while providing complete semantic descriptions of created blocks.
This is from Microsoft Research; I have to imagine that if these results have practical applicability they will find themselves on the Azure product roadmap within the next couple of years.
This is my favorite post of the past couple of weeks. Nassim Taleb has had several fantastically good ideas, so much so that if you are involved in data (or really any type of work involving decision-making) you really must be familiar with them. But he’s also insufferable, and if you’re sensitive to that kind of thing then his books will be a real slog for you to get through.
Fortunately, this article does a fantastic job of summarizing Taleb’s most important insights down into bite-sized nuggets. Highly recommended.
Lossless Image Compression through Super-Resolution
This is the official implementation of SReC in PyTorch. SReC frames lossless compression as a super-resolution problem and applies neural networks to compress images. SReC can achieve state-of-the-art compression rates on large datasets with practical runtimes. Training, compression, and decompression are fully supported and open-sourced.
This was a new idea for me, although likely not a new idea for others. What really blew me away was the quality of the compression—zoom in on the image in the link to see what I mean.
This approach to compression is quite different than traditional image compression; I’m very curious to see if this has any legs. Link to paper is in the Github readme.
7 Reasons To Not Hire a Data Scientist
This is probably preaching to the choir, but it’s a great succinct resource to convince someone not to hire a data scientist when it’s not a good fit.
Antipatterns in open sourced ML research code
This is one of the best Reddit threads I’ve actually ever seen. The author gives specific, useful advice to ML researchers or anyone else writing research-style code. Useful, actionable.
Thanks to our sponsors!
dbt: Your Entire Analytics Engineering Workflow
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Stitch: Simple, Powerful ETL Built for Developers
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123