Discover more from The Analytics Engineering Roundup
Decision Speed and One-Way Doors.
Why not all decisions are the same, and why any measure of effective analytics needs to recognize that.
New podcast episode! Julien Le Dem has contributed to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez. He’s currently leading OpenLineage, an open framework for data lineage collection and analysis. My favorite part of this conversation was Julien’s stories about the early Hadoop days. Oh, to be a fly on the wall…
Get it here. And enjoy the issue!
On Optimizing for Decision Speed
WATCH OUT—this thread contains cheat codes for your career!
Your job as a data practitioner is to facilitate the making of strategic decisions at your organization. Given that, it’s likely important for your career that you not only know how to surface relevant data but that you can shepherd the overall decision-making process forwards. Success == decisions made, not data delivered.
Benn, in conversation with Boris Jabes, picks up this thread and push it further, posing that the single biggest metric to measure the success of an analyst is the time between question and decision. Regardless of what it takes to get there, the analyst that consistently produces the shortest cycle time is the most successful.
There are a lot of good reasons to like this right off the bat, and Benn argues the point well so I’ll spare you (just read the post). What’s really interesting, though, is the potentially counterintuitive reasons this model holds up against challenges. For instance: how could you possibly optimize exclusively on speed and not on quality? Won’t this just produce consistently low-quality but fast decisions? There are other failure modes one could imagine, and Benn’s post does a good job of showing how speed is actually a better single indicator of success than you might at first imagine.
The one place where—IMHO—this model falls down is in “one way doors.” From the hallowed Amazon leadership principles:
Another tool we use at Amazon to assist in making high-quality, high velocity decisions is a mental model we call one-way and two-way doors. A one-way door decision is one that has significant and often irrevocable consequences—building a fulfillment or data center is an example of a decision that requires a lot of capital expenditure, planning, resources, and thus requires deep and careful analysis. A two-way door decision, on the other hand, is one that has limited and reversible consequences: A/B testing a feature on a site detail page or a mobile app is a basic but elegant example of a reversible decision.
When you step back and look at the decisions you make, you may find that the most of them are two-way door decisions. When we see a two-way door decision, and have enough evidence and reason to believe it could provide a benefit for customers, we simply walk through it. You want to encourage your leaders and employees to act with only about 70% of the data they wish they had—waiting for 90% or more means you are likely moving too slow. And with the ability to easily reverse two-way door decisions, you lower the cost of failure and are able to learn valuable lessons that you can apply in your next innovation.
There are few organizations better at the meta-process of decision making than Amazon, and this distinction, trained into all Amazon managers in their onboarding, is just so unbelievably important. Step one for any decision-making process should be: is this a one-way or two-way door? If it’s a two-way door, speed is everything. If it’s a one-way door, quality is everything.
Add to this the fact that many phenomena are actually power-law distributed and you realize that impact of a single irreversible, incorrect decision can actually outweigh that of 100 correct but less important decisions. Because of this, business, strategy, and analytics are not always repeated games—you can only bet the farm once if you lose.
I hate to be the downer that introduces nuance and complexity into a pleasingly straightforward model, but any model of analytical quality that I could buy into would need to take into account the type of decision being made (one-way vs. two-way door) and its potential impact on the business (where it likely sits in the power law distribution) prior to choosing whether to optimize for speed or correctness.
All that said…more decisions than you would think are, in fact, two-way doors. You should probably prioritizing speed a lot more frequently than you are today.
One final point from Benn’s post that bears repeating:
This reputational dimension also provides a useful nudge for analysts wondering where they can take their careers next. The second fastest way for us to influence a decision is for people to take our recommendations on their face, no questions asked. But the fastest way to drive a decision is to make it yourself. That, I think, is how analysts go from being advisors to executives: Build such a reputation for making convincing arguments that people simply hand the decision off to you.
+1000! This is not career advice that you will typically hear, but it is very real. I would have no idea how to do the job I’m in today if I hadn’t been helping other people make decisions for two decades now.
From elsewhere on the internet…
👷 Max Beauchemin (of Airflow and Superset fame) reprises his foundational posts that truly defined the field of data engineering with a recent update. I am so appreciative that this post exists in the world—Max, we needed an update, and we needed it from you! Everything in it is (IMO) dead on, even the part that throws some shade at the current dbt paradigm for expressing transformation logic:
On the other hand, and to make a comparison that most data engineers may not fully grasp, it feels like what early PHP was to web development. Early PHP would essentially be expressed as PHP code snippets living inside HTML files, and led to suboptimal patterns. It was clearly a great way to inject some more dynamisity into static web pages, but the pattern was breaking down as pages and apps became more dynamic. Similarly, in SQL+jinja, things increasingly break down as there’s more need for complex logic or more abstracted forms. It falls short on some of the promises of a richer programming model. If you take the dataframe-centric approach, you have much more “proper” objects, and programmatic abstractions and semantics around datasets, columns, and transformations.
I don’t have a strong opinion on the correct abstraction being the dataframe, but I do agree that, when you push the “mutating a big string of SQL” approach really hard, things get challenging. We’ve done the PHP thing…time to move towards React.
Just to say it again…this post is fantastic. Must-read.
🕵🏽 From Alex Viana, VP of Data @ HealthJoy:
However, it’s not just tooling that distinguishes Analytics Engineering from a traditional analyst; it’s the type of work they do, and even more than that, the mindset that these tools enable. I think this difference can be seen in the amount of time Analytics Engineers spend thinking about data modeling. Our team is constantly meeting with both engineering and business stakeholders to craft end-to-end transformations that take into account everything from the underlying architecture of the source data, to the reusability of the transformed models, to exporting the data to other operational systems (sometimes called “reverse ETL”). Accountability for the data we are producing and how it’s used is at the center of everything we do.
Yes!! Analytics engineering is about shifting the locus of attention from individual question/answer/decide cycles to creating systems that enable more humans to go through this cycle faster and with higher decision quality.
😬 Data Exchange for Redshift just launched. Think of it just like Snowflake’s Data Marketplace, but for Redshift, and…launched multiple years later. Shade aside, this will be useful for Redshift customers.
🔥 Li Haoyi reflects on four years as a staff software engineer at Databricks. While this isn’t really a post about data, I found it fascinating because of how it gives outside viewers a sense of what it feels like to work at one of the epic data companies of our age:
2017 Databricks had a ton of fires. Internal systems that were just broken, tons of missing processes, huge gaps of missing expertise as the business grew fast and engineering struggled to keep the pace. Code quality was poor, system architecture was nonsensical, infrastructure constantly falling down. It was clear even after a casual conversation that everyone was in way over their heads. I found out soon after joining that the team I had joined had just imploded from 3-4 people down a single individual as the others quit.
Don’t worry, it seems like things have gotten a lot better since then :P Fascinating read just for general ecosystem awareness.
🚀 Also in Databricks news, a decisive new benchmark:
These results were corroborated by research from Barcelona Supercomputing Center, which frequently runs TPC-DS on popular data warehouses. Their latest research benchmarked Databricks and Snowflake, and found that Databricks was 2.7x faster and 12x better in terms of price performance. This result validated the thesis that data warehouses such as Snowflake become prohibitively expensive as data size increases in production.
I don’t want to claim any specific expertise here, but do want to indicate that this does check all my boxes for “appears-to-be-credible.” This article actually sparked a great conversation in our internal dbt Labs Slack about where on the list of buying criteria price-to-performance ratio is. So I ask: how much of a price-to-performance increase would your company need to see in order to justify replatforming?
📝 Luke Singham at Monzo writes about their data stack. The most interesting thing to me is the `indirect_ref()` function they implemented on their fork of dbt. Very, very cool. Contracts between models is an area that I’ve been interested in for a while.
⁉️ Did you know that `create assertion` is a part of the SQL spec, and has been since 1992?? I did not, but Nicholas Chammas did. Unfortunately, not a single database supports it.
The most fascinating part of this article is the thinking on incrementally calculating assertions—it actually gets to be quite pricey to run dbt tests on large datasets, but with effective incremental logic it should be doable to calculate them incrementally. Food for thought…