Analytics is a Mess. Good DS, Bad DS. Analytics Engineering vs. Data Engineering. Understanding the Optimizer. [DSR #252]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
There’s so much to like here. The meat of the argument can be summed up here:
There is no correct win rate waiting to be unearthed; one version isn’t true while another is false. Each version is equally accurate because they are tautological: They measure precisely what they say they measure, no more and no less. Our job as analysts isn’t to do the math right so that we can figure out which answer is in the back of the book; it’s to determine which version, out of a subjective set of options, helps us best run a business.
This gets so effectively at the thing that I’ve found to be the single hardest thing to mentor other data analysts on: business context! What is interesting? What should make you curious to dig in deeper? What change would a given piece of information lead you to make in the world?
Another thing I’ve been thinking about recently is explicitly creating separate workspaces for curated (production) and messy (experimental, not-yet-production) work. I think one of the challenges Benn is describing isn’t just that analytics is messy…it’s that teams often co-mingle the messy with the clean parts of the process.
Which ends up coming down to environment management. Creating end-to-end workflows that facilitate environment management is harder than it should be today. How could we make that easier as an ecosystem?
Good DS starts simple, ships, and then iterates. Bad DS starts with the most advanced technique they know.
This is just one of many fantastic nuggets in this post. If you are, or know, someone who is starting out their career in data science, please share this with them. These insights are the ones rarely focused on and yet far more determinative of success than the specific programming languages and statistical techniques in your tool belt.
Preach! Some selected spicy quotes:
It turns out analytics engineering is a goddamn superpower.
No data analyst, anywhere, has ever ever come close to performing all of the analysis they think could be impactful at their organization.
Software is commonly said to be eating the world — analytics engineering will be embedded in the world.
Analytics engineering is fundamentally a discipline that’s about making sense of the world around us.
Could not agree more! Also: I feel like I’m attending a rally. Jason, are you running for office? 🔥🔥
Data engineers love building complex Rube Goldberg machines which can be easily replaced with simpler systems that run at a fraction of the original cost.
This tweet is a very succinct summary of one of the most important posts that I’ve ever read: Engineers Shouldn’t Write ETL. If you find yourself saying, “hey, data engineers are valuable!” you’re not wrong–it’s that the org structure that they typically operate in leads to very poor outcomes and rampant mediocrity. Read the above post to understand why that’s the case. It’s just as true today as the day it was penned in 2016.
Heh…wow. This is a post that I had queued up to read for the past couple of months and am only now getting back to. It’s one of the deeper blog posts on the link between sql you wrote and explain plan your database ran.
This is one of the biggest areas that I see new analytics engineers struggle with and is probably the deepest that an AE has to go on a purely technical / CS fundamentals continuum. In fact, you can skip this knowledge and just kinda cross your fingers that the optimizer will give you good results for a little while…but if you want to truly feel confident traversing any dataset, this is knowledge you need.
The post focuses on how different database engines optimize correlated subqueries. Here’s just the tip of the iceberg to give you a taste:
The easiest way to execute this is to run the subquery once for each row in the outer query, but this is potentially very inefficient. Databases rely on being able to collect, reorder and batch operations to reduce interpreter overhead and optimize memory access patterns. Running the same query many many times in a nested loop reduces that optimization freedom.
The author is a true expert in the field and is quite good at making somewhat arcane concepts accessible IMO.
Popular Dev Tools aren't just solving a problem. They solve core emotional needs
* HuggingFace makes you feel smart
* Unity makes you feel like a kid again
* Github makes you feel seen
* Fastai makes you feel like you belong
* VSCode makes you feel like a tinkerer
This caught my eye for two reasons:
Oooook…software developers like to get married to each other in the Bay Area! The rate that this match happens over prediction is really something.
The visualization type—categorical actual vs. predicted—is one that is really under-utilized in most BI and EDA. What datasets do you use every day that might warrant this treatment?
How does using dbt make you feel? Reply here.
Thanks to our sponsor!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123