Discover more from The Analytics Engineering Roundup
The State of Deep Learning. Predicting Grocery Availability @ Instacart. Speed Matters. [DSR #165]
❤️ Want to support us? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
It’s really hard to keep track of developments in a 🔥 field like deep learning.
Agreed. This post shares the top three posts from companies, top three posts from community members, top areas of interest, and top frameworks, among others. Great read to stay abreast of the developments in DL over the past six months.
How and why Instacart uses machine learning to predict real-time availability of 100s of millions of grocery items being sold across US and Canada.
This is an excellent walkthrough of a complicated problem and its solution. There’s some great discussion in the post around particularly tricky challenges—high categorical cardinality and sparse data in the long-tail—where the team has come up with clever / useful solutions. 👍👍
Convoys is a simple library that fits a few statistical model useful for modeling time-lagged conversion rates.
Estimating eventual conversion rates for recent cohorts is a critically important problem for anyone that works at a startup. You can’t wait 12 months to see if your acquisition efforts are effective—you need to have some mechanism of judging them within 7 or 14 or 30 days (maybe less!). I had always used the Kaplan-Meier estimator to handle this: it’s a simple, easily-explainable measure to estimate unobserved data. Turns out there is a better way.
It turns out we can model conversions by essentially thinking of conversions as a logistic regression model multiplied by a distribution over time which is usually a Weibull distribution or something similar. Convoys implements a few of these models with some utility functions for data conversion and plotting.
Highly recommended. I haven’t had the opportunity to play with it yet but plan to on an upcoming project.
Michael Kaminsky is blowing up the current notion of an A/B test:
Traditional A/B testing rests on a fundamentally flawed premise. Most of the time, version A will be better for some subgroups, and version B will be better for others. Choosing either A or B is inherently inferior to choosing a targeted mix of A and B.
This is a fairly basic statement—it seems obviously true upon immediate consideration. It turns out to have rather significant impacts, though, if taken seriously. That is what the rest of the post explores.
The obvious benefit to working quickly is that you’ll finish more stuff per unit time. But there’s more to it than that. If you work quickly, the cost of doing something new will seem lower in your mind. So you’ll be inclined to do more.
I strongly believe this to be true. This post is a classic—it’s actually from 2015. I came across it recently and it reminded me of a Twitter conversation that I just had with some fantastic analysts.
The crux of the conversation for me was that, as an analyst, it actually does matter that you’re good at writing code. The better your technical chops the faster you can get to answers, and therefore the more answers you can discover. It becomes a virtuous, non-linear cycle of knowing stuff.
Link into the tweet below to see the full thread.
@bennstancil @oldjacket @ianblu1 @Nonnormalizable @jakeklamka @davidjwallace As such, the better you are at writing code, the lower friction you will have in the iterative process of asking, answering, asking, answering... And the lower friction = more iterations = more refined viewpoints. You can literally just know more stuff.
Did you know that there are currently thousands of satellites orbiting the Earth? I certainly did not, and would have guessed a few hundred at most. Today, high school and college students design, fabricate, and launch nano-, pico-, and even femto-satellites such as CubeSats, PocketQubes, and SunCubes. On the commercial side, organizations of any size can now launch satellites for Earth observation, communication, media distribution, and so forth.
All of these satellites collect a lot of data, and that’s where things get even more interesting. While it is now relatively cheap to get a satellite into Low Earth Orbit (LEO) or Medium Earth Orbit (MEO) and only slightly more expensive to achieve a more distant Geostationary Orbit, getting that data back to Earth is still more difficult than it should be. Large-scale satellite operators often build and run their own ground stations at a cost of up to one million dollars or more each…
Your current project probably doesn’t have you collecting data from orbiting satellites, but how freaking cool is it that your next one could. AWS now offers extra-planetary data connectivity. The future is here.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to build analytics teams. Whether you’re looking to get analytics off the ground after your Series A or need support scaling, let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123