Prefect. Airflow @ Lyft. Brooklin @ Linkedin. Data Warehouse Organizing Principles. Sampling Algos. [DSR #193]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
I’ve been following Prefect with interest since its open source release in March of this year, as have many members of the data community. Airflow is so prevalent as an orchestrator, and Prefect explicitly aims to build a next generation version of Airflow that many have taken notice (although with plenty of very natural skepticism!).
This post is the best answer yet to the questions “How is Prefect different than Airflow?” and “Why might I want to use Prefect instead of Airflow?”
Data engineering, and specifically in this case batch orchestration, is certainly not a solved problem. Airflow was massive step forwards for the entire industry, but it’s been the dominant orchestration tool now for probably 4 years. If nothing else, this post represents smart people who have thought hard about what the next frontiers in batch orchestration will be.
Also re: Prefect—check out one of their cofounders, Jeremiah Lowin, on the Data Engineering Podcast last month. It was a good interview, Jeremiah is a dynamic speaker.
Sticking with Airflow for a second, this is a stellar post where the Lyft data eng team talks about their production Airflow deployment (500+ DAGs!). They discuss:
monitoring & SLAs
customizations they’ve made
production performance and reliability
In this post, even in the process of outlining a very sophisticated Airflow environment, it’s hard to miss the areas of the product where duct tape needed to be applied. The monitoring system, in particular, felt somewhat rudimentary relative to its criticality—there is clearly a lot of scope for a managed service to add value here.
This project comes out of Linkedin, birthplace of Kafka, and is a very close cousin. My read of this release post is that Linkedin is primarily using this as a layer on top of Kafka that will allow it to replicate Kafka streams to multiple (often quite disparate) environments. This functionality was originally provide by Kafka MirrorMaker, and this post does a good job of explaining the utility:
Kafka supports internal replication to support data availability within a cluster. However, enterprises require that the data availability and durability guarantees span entire cluster and site failures.
If you need your Kafka streams to be redundant across multiple datacenters, you’re a serious data engineering organization. Very cool that this seemingly quite mature project has been released into the wild.
3 * 5
I don’t know how it happened that three of my favorite posts this week were all listicles of five. 🤷
The natural state of the universe is chaos: entropy tends to increase in closed systems. So too is the nature of data warehouses: unless action is taken to maintain order in your data warehouse, it will inevitably spiral into a hard to navigate, hard to operate collection of objects that you’re too afraid to delete.
This is true. We (Fishtown Analytics) are often asked to work inside of an existing warehouse. Sometimes that experience is great, sometimes…less so. Keeping your warehouse organized is often the difference between a productive and an unproductive data team.
Sampling is an important topic in data science and we really don’t talk about it as much as we should.
100% agree. This post is short and sweet, and covers a topic I’ve never seen written about compellingly before. Highly recommended.
This is a great post—it’s in a vein I think is very important. Data scientists & analysts write a lot of code, but often aren’t ever taught how to write good code. It turns out that applying fundamental software design principles to any code is incredibly useful.
There is no single post that will take you from zero-to-sixty on this topic in ten minutes, but this post acts as a nice teaser: the principles in it guide decades of software development and contain a tremendous amount of collected wisdom.
Data Viz of the Week
Wow. Parks, rec, & leisure?! Would not have predicted that.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123