❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Many data pipelines are miserable to monitor and troubleshoot. This used to be true of applications as well, but the state-of-the-art for application development has improved its processes and tooling:
Observability is a fast-growing concept in the Ops community that caught fire in recent years, led by major monitoring/logging companies and thought leaders like Datadog, Splunk, New Relic, and Sumo Logic. Observability allows engineers to understand if a system works like it is supposed to work, based on a deep understanding of its internal state and context of where it operates.
So…let’s just apply these tools to data engineering, right? Wrong.
Using these general-purpose tools, Data Engineering teams can gain insight into high-level job (or DAG) statuses and summary database performance but will lack visibility into the right level of information they need to manage their pipelines. The reason standard tools don’t cut it is because data pipelines behave very differently than software applications and infrastructure.
This is just the setup: the post itself contains tons of detailed suggestions for data eng observability.
Wow! I linked to Part 1 in this series two weeks sago, and parts 2 and 3 are already out. And they’re so good—these posts collectively are the single best thinking I’ve seen to-date on how data career paths should be structured.
I certainly think there’s a lot of work left to do in this area, but hopefully these three posts will both convince data teams that they need career ladders and will give them enough of a structure to start with. From there, the industry will run lots of experiments.
This is a surprisingly interesting article on a long-running religious war: leading or trailing commas in SQL? The conclusion is that there is a consistently-observable decrease in error rate due to usage of leading commas! This doesn’t totally surprise me. But I, like the author of this post, (the inimitable Benn Stancil of Mode) find myself unrepentant. Benn’s comments in the final paragraph are at the same time amusingly extreme, but I also do agree: I refuse to use leading commas simply because I find them ugly, and I do not want to stare at code all day that I find ugly. Error rates be damned.
This is somewhat of a silly topic, but it’s a surprisingly good analysis from one of the most authoritative sources possible.
About 15 months ago, I left my full-time job as a machine learning team lead with the goal of doing independent / freelance data science consulting. Since then, I’ve gotten a lot of questions about what that means and entails. I have not found too much information about this type of work, other than Greg Reda’s fantastic post. I hope this blog post answers some of those questions for anybody interested in becoming or hiring a data science consultant.
Like the author says, there just aren’t many posts about this. Almost four years into starting Fishtown Analytics, I believe more than ever that there is an incredible amount of demand for data talent, and more willingness than ever to source that talent externally. If you’ve ever considered going out on your own, I very much recommend it, and this post is a great resource.
If you’re a data scientist, you’ve surely encountered the question, “How big should this A/B test be?” The standard answer is to do a power analysis, typically aiming for 80% power at α=5%. But if you think about it, this advice is pretty weird. Why is 80% power the best choice for your business? And doesn’t a 5% significance cutoff seem pretty arbitrary?
I like this post a lot—80% power and 5% significance were originally selected primarily for academic work, and as statisticians move into business domains they generally don’t think a lot about these baseline numbers. But power and significance are statements about priorities. How much does time matter? How much does certainty matter? It turns out that the answers to these questions are quite different to organizations operating in different contexts.
Really fantastic, thoughtful post.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123