Discover more from The Analytics Engineering Roundup
Data Science Reality != Expectations. Proxy Metrics @ Netflix. Dolt. An Introduction to Circuits. [DSR #223]
I’m switching to my summer schedule (one issue every other week) a bit early this year in light of the demands on my personal life right this minute. Thanks for your patience…good stuff still coming your way, just a little bit less frequently!
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Seven common ways a data science role may not meet your expectations.
This is likely not the first time you’ve read an article like this. “Companies suck at data science and the job is much less rewarding than you’re hoping.” This is both true and (at this point in the industry) not that interesting—it should be common knowledge for you by now that most companies suck at data science. It’s unclear why we would expect otherwise, given that it’s (a) an extremely new field with (b) largely inexperienced practitioners, © requires a high degree of technical acumen, and (d) is much more experimental than most companies are comfortable with. Of course companies suck at it and thus of course most data science jobs are disappointing if you came in with high expectations.
So why am I linking to this? What’s interesting to consider is: given all of the above is what does one do? The author doesn’t provide a ton of guidance on his suggested response for practitioners, so I wanted to try. Hopefully you find this helpful.
Heavily weight team strength vs. everything else when considering offers. Likewise for the company’s overall posture towards data. Working at Etsy or Stitch Fix is a fundamentally different experience than working at 50-person-startup-X as data hire #1.
Be open to less “exciting” roles at the start of your career in favor of, again, team strength. It’s better to join Spotify as a data analyst than company-X as a data scientist. Seriously. You will just learn so much.
You may not find that you’re getting offers with the most compelling teams. Rather than taking data scientist positions at companies that will give you a shitty experience, consider taking line-of-business roles that you’re qualified for at data-forward companies and bringing data in as a skillset rather than your primary job function. This is how many people get into data, and while it takes a bit longer IMO it’s a great path.
Overall, the correct response is to change your career planning strategy, not your long-term goals. Good luck.
Former VP Product @ Netflix talks about a topic that I’ve not seen written about before but one that is close to my heart. I used to run a marketing team, and we did a lot of experimentation. We often got pressure to optimize for revenue in our experiments—to track the impact of our changes all the way through the funnel, validating that each experiment actually resulted in new revenue. There is a logic to this: our most important task as a marketing team was to drive new revenue after all!
But proving an effect on the new revenue number was incredibly hard. It required a massive sample size and a very large experimental effect. We ended up picking different metrics to optimize for with different experiments. The author calls these “proxy metrics"—they act as stand-ins for the "north-star” metric but are easier to experiment on.
Proxy metrics are a stand-in for your North Star product metric. First, you seek a correlation between your high-level metric and the proxy metric. Later you work to prove causation.
This article puts words around this struggle I used to have far better than I ever did. 👍👍
Dolt is a relational database, i.e. it has tables, and you can execute SQL queries against those tables. It also has version control primitives that operate at the level of table cell. Thus Dolt is a database that supports fine grained value-wise version control, where all changes to data and schema are stored in commit log.
This is interesting–another data version control solution! DVC and Pachyderm have been operating in this space for years, but this is the first solution I’ve seen to essentially turn the log of a database into a git commit stream. I can’t tell whether that feels like a…good idea…or not?? I’ll be interested to follow how Dolt evolves.
Most work on interpretability aims to give simple explanations of an entire neural network’s behavior. But what if we instead take an approach inspired by neuroscience or cellular biology — an approach of zooming in? What if we treated individual neurons, even individual weights, as being worthy of serious investigation? What if we were willing spend thousands of hours tracing through every neuron and its connections? What kind of picture of neural networks would emerge?
In contrast to the typical picture of neural networks as a black box, we’ve been surprised how approachable the network is on this scale. Not only do neurons seem understandable (even ones that initially seemed inscrutable), but the “circuits” of connections between them seem to be meaningful algorithms corresponding to facts about the world. You can watch a circle detector be assembled from curves. You can see a dog head be assembled from eyes, snout, fur and tongue. You can observe how a car is composed from wheels and windows. You can even find circuits implementing simple logic: cases where the network implements AND, OR or XOR over high-level visual features.
I really like this line of thinking.
The authors have built (and open-sourced) a model to predict unemployment claims on a state-by-state basis using some fairly straightforward Google Trends searches. It seems to be quite effective.
Part of the art of data science is choosing the right problem to work on. This problem, using the right dataset, seems to be surprisingly tractable. Impressive work.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123