Airflow on Kubernetes. Procella @ Google. Analytics Engineering. Type-2 SCDs. Drug Discovery. [DSR #197]
❤️ Want to support this project? Forward this email to three friends!
🚀 Forwarded this from a friend? Sign up to the Data Science Roundup here.
This week's best data science articles
Tl;dr: only use Kubernetes Operators
Running jobs with heterogenous dependencies as a part of a single DAG feels like it shouldn’t be that hard, but as soon as you have two requirements.txt files things can get bad quickly.
The engineering team at Bluecore didn’t love their original Airflow experience and developed an opinionated solution involving Docker and Kubernetes. They haven’t looked back—the results have been nothing but positive.
Is this the right approach for everyone? I actually don’t know. Airflow came to market prior to the rise of Docker and Kubernetes, but at this point I have a hard time imagining wanting to run a huge Airflow installation without the infrastructure they provide.
This repository is a curation of good blog posts and books for Analytics Engineers. It can also be very useful for Data Analysts and Data Scientists.
The job title “Analytics Engineer” is taking off. I just attended a fantastic meetup in NYC with 100+ attendees focused on this role that included super-smart panelists and great audience-generated conversation. There was energy in the room, and the issues under discussion weren’t just technical—how to do x thing with data—they were organizational.
The rise of the analytics engineer is a response not only to a shift in tools and technologies, it’s a response to a shift in the way that organizations think about building a culture of data. Enabling self-service was the biggest focus of the people at this meetup, a topic that included not only building clean data models, but great documentation, strong stakeholder partnerships and training programs. Every one of these things is critical for an organization to truly be data-empowered.
This movement towards analytics engineering is the most exciting thing I’ve seen in data for a long time, because it’s focused on organizational outcomes, not individual capabilities.
More to come on this in a future blog post I’m sure. In the meantime, this recently-published reading list is a great place to get started if you’re new to the space. Contains all of the canonical stuff to get you up to speed.
Ummmm: holy shit. The folks @ Google are really living in your and my future. Stick with me for two quotes:
Large organizations… are dealing with exploding data volume and increasing demand for data driven applications. Broadly, these can be categorized as: reporting and dashboarding, embedded statistics in pages, time-series monitoring, and ad-hoc analysis. Typically, organizations build specialized infrastructure for each of these use cases. This, however, creates silos of data and processing, and results in a complex, expensive, and harder to maintain infrastructure.
The big hairy audacious goal of Procella was to “implement a superset of capabilities required to address all of the four use cases… with high scale and performance, in a single product”.
The post itself is a summary of a recent paper out of Google describing their system, Procella. It’s a SQL-based system where you can have your cake and eat it too—it’s blazing fast for each analytical use case listed above. This is an incredibly difficult achievement: rather than making the tradeoffs inherent, the team found ways of having optimal performance for each. The post goes pretty deep on the innovations required, including a new file format and an adaptive optimizer (which is insanely cool).
The past decade has seen a ton of innovation in SQL serving systems, but we’re not done yet. The stuff we’re going to see delivered over the next decade is going to be awesome.
As we abstract higher and higher we are finding insights through patterns / themes and learning by separating concerns, conversely as we make our way down to the data points, we are understanding the precision and nuance in our data through learning by example.
This is a fantastic post about data visualization design and about the usefulness of being able to traverse up and down the various level of abstraction naturally present in data. This feels intuitive when it comes to zooming in and out of Google Maps, but there are very explicit design features that enable the experience to be a good one.
Most modern data warehouse stacks don’t track changes in mutable data. If you load a table in Snowflake via Stitch or Fivetran and a record in that table changes in the source, the corresponding change happens in Snowflake. Any record of the historical state of that record is destroyed completely.
Sometimes, this is 🤷, but sometimes this is a big problem. Critical tables that are used to build financial metrics, in particular, are sensitive to this problem, because finance teams tend to frown upon historicals changing once they’ve been reported to the board.
This post goes into how a Kimball modeling technique, Type-2 Slowly Changing Dimensions, solves this problem, and how to easily implement Type-2 SCDs in dbt.
This is an important and often-overlooked topic, and the post is a perfect entry point if you want to get yourself up to speed.
Author / investor / scientist Nathan Benaich outlines how drug discovery is being completely reworked by computational approaches. The whole post was great, but the part that resonated most was at the very end—the author compares drug companies to game studios, where the core competitive advantage is the process / technology used to repeatably bring new games (or drugs) to market. Computational biology is an upgrade to that machine.
We’re beginning to see what the deployment phase looks like for this round of AI.
Thanks to our sponsors!
Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue
915 Spring Garden St., Suite 500, Philadelphia, PA 19123