#BlackLivesMatter ✊🏿✊🏽✊🏻. Proficiency vs. Creativity. Serving ML Models in Production. Joins in Druid. Designing Metrics. [DSR #227]

What a hard two weeks. The murder of George Floyd was heartbreaking, and the courage displayed by protestors around the country has been inspiring. My friend Josh Laurito said it well in his newsletter:

Black Lives Matter, and I believe that anyone in America with a platform or an audience of any kind who recognizes that should affirm the same, no matter how apolitical their topic or audience is.

I agree.

There are a number of actions that I’ve personally taken this week, but supporting my friend and cofounder Drew Banin was probably the biggest:



To be silent is to be complicit. https://t.co/m3n6B9cF29

2:47 PM - 4 Jun 2020

Immediately after Drew’s tweet, Connor McArthur and I followed suit, each donating $1,900 and matching another $1,900. We each have some room left on our matches; please respond on the Twitter thread with a receipt and you’ll get a “like” back once one of us has matched.

I’m often a voice who has strong opinions, who has answers. I don’t have answers here. My statement is one of solidarity and support: to the Black Community, to the protestors, to allies who align themselves publicly with the cause. I’m here, I’m listening, I’m with you.


Lots of love.

- Tristan

This week's best data science articles

Proficiency v. Creativity

Your data team has to produce solid data. The pipelines have to run, the logic in your transformations has to be sound, and the report has to show accurate revenue. Those fundamentals are hard to argue with. But if that’s all you’re doing, your team is probably bored and your organization definitely isn’t getting as much value as it could out of its data.

Open-ended creative work is a huge part of the appeal of working in this field - identifying opportunities to improve processes, appeal to new customers, or build better products adds value for the organization, but it is also just incredibly personally satisfying. One of the fundamental challenges of managing a data team is balancing the need for rigor and reliability with the team’s desire to spend most of their time creating new knowledge. How do we manage those sometimes conflicting priorities?

This is an exploration of a topic that I’ve never thought deeply about although is absolutely a tension I’ve experienced. Another fantastic post by Caitlin Moorman.


Personal Statement from Rediet Abebe

Rediet Abebe recently joined UC Berkeley’s EECS Department as an Assistant Professor. At Berkeley, she will be the first Black female computer science professor on faculty. In this excerpt from her application’s personal statement that she recently shared, she describes her undergraduate experience as a low-income international student from Ethiopia and as one of the only Black women in the Mathematics department.

If creativity is a critical ingredient in the data science process, diversity and inclusion are at least as important to the work we do as to any other field.

H/T to Amplify’s excellent data science newsletter on this post, and thanks to Professor Abebe for sharing her story.


Introduction to JOINs in Apache Druid

In Apache Druid 0.18/Imply 3.3, we added support for SQL Joins in Druid. This capability, which has long been demanded from Druid by the community, opens the door to a large number of possibilities in the future. In this blog I want to highlight some of the motivations behind us undertaking the effort and give you, the reader, an understanding of how it can be useful and where we’re going with it.

This is a very interesting development. Druid is a powerful tool for in-memory analytics and has incredible response times in many contexts; to-date, the lack of joins was one of the big challenges associated with using it for many workloads.

Druid still doesn’t support arbitrary joins from any table to any table (although that appears to be on the way; for details on the functionality please do dig into the article.

This is an interesting thread to follow for anyone in the industry, because the core data processing technologies define so much of what is possible both up- and downstream of them. They define the “laws of physics” for the data ecosystem at any one point in time. And Druid, Clickhouse, and Pinot are mounting an interesting challenge to the now-status-quo of Bigquery and Snowflake. What fundamental changes in the ecosystem could we see if Druid becomes a primary destination for SQL workloads in the coming years?


How to Serve Models

How to Serve Models

There are many ways to serve ml(machine learning) models, but these are the most common 3 patterns I observed over the years:

1) Materialize/Compute predictions offline and serve through a database,

2) Use model within the main application, model serving/deployment can be done with main application deployment,

3) Use model separately in a microservice architecture where you send input and get output

Yep yep yep. This is the clearest post I’ve read on this topic before; extremely helpful if you’re thinking about how to design a production ML system right now. The author runs the search engineering team at Jet.com, and his recommendations are those of an experienced practitioner: he doesn’t push the reader straight to the most architecturally “pure” approach (#3), very much recognizing the overhead required to run the microservices architecture that it requires.


Tracking State with Type 2 Dimensions

Application databases are generally designed to only track current state. (…) But, as analysts, we not only care about the current state (how many users are using feature “X” as of today), but also the historical state. (…) To accomplish these use cases we need a data model that tracks historical state.

In this post, I’ll show how you can create these data models using modern ETL tooling like PySpark and dbt (data build tool).

This post does an excellent job of tackling what can be a very dry subject. I want to add two things:

  1. Capturing historical state in the warehouse is a superpower most data folks don’t know that they have! It has saved my ass multiple times. If you’re not familiar with dbt’s snapshots feature, check it out.

  2. It’s neat to see Shopify using dbt, and it’s also neat to see how much simpler it is to accomplish this use case in dbt than in the corresponding PySpark code!


Designing and evaluating metrics

Such a wonderful definition:

A metric is simultaneously 1) a designed artifact, 2) a lens through which we observe phenomena, and 3) way we set and monitor goals.

This post is an excellent discussion of how to design a metric, but also importantly, how to think of the lifecycle of a metric. This was a new thought to me, but immediately resonated. Metrics don’t last forever: they are born of a need, and eventually are retired once they have achieved their goal.



Thanks to our sponsors!

dbt: Your Entire Analytics Engineering Workflow

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analyzing it. dbt allows anyone comfortable with SQL to own that workflow.


Stitch: Simple, Powerful ETL Built for Developers

Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.


By Tristan Handy

The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.

Tweet Share

If you don't want these updates anymore, please unsubscribe here.

If you were forwarded this newsletter and you like it, you can subscribe here.

Powered by Revue

915 Spring Garden St., Suite 500, Philadelphia, PA 19123