Down With Experimentation Maximalism!

Experimentation is a tool—one that I think our community overuses.

Jun 26, 2022

This isn’t some kind of screed attempting to get you to give up your T tests or your feature flags. It’s not a suggestion that experimentation programs are not valuable in the right context and with the right expectations.

But many of the most technically adept data professionals in the world have spent their careers inside of Big Tech and a tremendous amount of the work they’ve done has been focused on experimentation. And these practices, this mindset, and the associated tooling bleeds out from FAMGA companies into the data industry as a whole.

We don’t often even perceive this process as it is happening—by now it is simply the natural forces shaping our ecosystem. You don’t wake up thinking about plate tectonics; similarly, you probably don’t spend a lot of time thinking about how much of our modern data tooling and mindset was originally born of Big Tech.

In order to better “see” the mindset that I’m talking about, I want to name it. And the best name I could come up with is Experimentation Maximalism (EM). For the purposes of this newsletter, I’m defining experimentation maximalism as:

the belief that all changes to the product experience must demonstrate quantifiable improvement on a pre-defined metric using a pre-defined experimental method

I don’t have any particular problem with EM as a data professional. Experimentation is a tremendously enjoyable problem space to work in—in fact, experimentation is designed specifically to uplevel quantitative skills! In a messy world that we are trying to reason about post-hoc, causality is nearly impossible to demonstrate with rigor. But in a carefully-designed experiment, we can construct a world where quantitative reasoning can be used to make statements about causal relationships. So: as a data professional, it’s very compelling to work in an environment where my skills are so directly relevant, so clearly valuable.

My problem with experimentation maximalism is, rather, as a product creator. And, while I’ve had this view for a long time, I just recently read a post by Sean Taylor that voiced it better than I ever could have. Let’s start here:

In August 2020, Garrett van Ryzin decided to leave Lyft and I took over the research team he created called Marketplace Labs. I met with him regularly in his last few days to understand his strategy for the team. I asked him why he hadn’t spent any energy on statistical problems at all and he had a great answer: “when a project we work on succeeds, we don’t need statistics to know it.”

I haven’t run experiments on the scale of the Lyft network, but my experience is identical. I might go even a bit further: many product experiences that I’ve released into the wild aren’t measurable at all by near-term metrics.

How is that possible? Why would I spend time on building a product experience that I don’t expect to impact metrics that I care about?

There are many parts of the dbt experience that illustrate this perfectly. Probably the biggest three are sources, exposures, and snapshots. Each of these parts of the product is roadmap-driven. Each was built as a first step down a path guided by a product vision, not by user research or near-term desire to influence metrics. None of these features received widespread usage at launch, none of them have (yet) made a shred of commercial impact, and they’re still only shadows of our vision of what they will become. This is all exactly what we anticipated, which is 100% fine—the investments we’ve made into these areas of the product have been time very well spent.

The assertion that the post from above—titled “Locally Optimal”—makes is that structured experimentation programs are good for exploiting current product innovations (finding a local optimum), but are not helpful for identifying new product innovations.

Sure, incrementally better decisions add up to a lot of value over time, but maybe we’re just stuck in a local optimum and getting many small changes right will never get us to where we want to go. A modification of the famous Henry Ford quote kind of works here: you can’t A/B test your way from selling horses to selling cars. And a corollary: if you’re testing a horse against a car, you definitely don’t need an A/B test.

This is one of my favorite paragraphs I’ve ever quoted. I can’t even begin to tell you how much this resonates with me but also how artfully memorable I find the framing. It’s just really hard to disagree with.

And yet, we are constantly training early-career data professionals on practices that emanate from Big Tech and promote experimentation maximalism without realizing that these practices emanate from a completely different context. Which is to say: many of these early-career folks are working at companies that haven’t yet built their money-printing machines yet. They are pre-product-market-fit. They don’t need to find a local optimum, they need to help create some global improvement that’s worth optimizing in the first place.

My last company, RJMetrics, was a perfect example of this. We lost the race to become the dominant BI tool of the 2010’s because we lacked PMF in a post-Redshift world. Yet we spent so much time on experimentation along the way. Looking back, none of that experimentation work improved our terminal enterprise value—what we should’ve been doing instead was diagnosing our lack of PMF and doing the soul-searching (and analysis) that ultimately led to the launch of Stitch. If we had gotten to that decision a year earlier, it could’ve led to a multi-billion-dollar differential in outcomes.

Sean suggests that this problem—the diagnosing-causes-of-failure problem—is fundamentally harder than A/B testing. It forces us to…

(…) imagine a cause that could possibly generate the effect that we want. I tend to think of this as a high-dimensional search problem, very similar to what folks doing drug discovery are trying to solve – there are so many possible chemicals that we can synthesize, which ones are likely to be good treatments?

It also hypothesizes that there is in fact more information created in the process:

At best, a successfully analyzed A/B test may reveal 1-bit of information, when we go from 50/50 on which variant is better to being sure that one of them is. In the root cause analysis, we may consider dozens of potential explanations for a problem we observe, so if we do successfully debug we’re creating a few more bits than an A/B test.

I don’t have strong instincts on those two statements, but I find them interesting and worthy of further consideration.

If there is a single takeaway from this entire train of thought, it’s this: understand the context that your organization is operating in.

If the organization you’re working in is pre-PMF, you should be spending a lot of time generating hypotheses about what’s not yet working. This will likely look more like descriptive statistics, a lot of non-linear thinking, and collaborative problem-solving. You should resist the urge to rely on structured experimentation and instead be searching for insights that will unlock fundamentally new experiences for your customers.
If you’re working in a post-PMF org, expect to spend more time inside of a structured experimentation program that is designed to optimize a process that is already working. There will be more guardrails but you’ll also likely get to use more sophisticated methods.

These two modes of supporting an organization are extremely different. They have different analytical methods, different workflows, different patterns of thought, and different collaborative styles. When you’re interviewing for jobs you should know which you’re getting into.

As always, oversimplification is both useful and terrible. The above mental model is, of course, an oversimplification and I’m sure you have already (or could) poke holes in it. I’m not suggesting that there are zero valid use cases for experimentation in a pre-PMF organization. However, I do think that experimentation has become a shibboleth within the data community, as if a structured experimentation program is always the gold standard for how innovation is done.

Sometimes I think that this has more to do with the fact that experimentation is a safe way for nuanced thinkers (like data people!) to make decisions. It relies on a process and on data to make what otherwise would have been a decision made on insight and instinct. It is hard to make a truly bad decision using experimentation, and all outcomes from a well-run experiment have at least the patina of reasonableness. This is very good for the careers of decision makers! Just as no one got fired for buying IBM, no one gets fired for running an experiment.

But fundamentally new, fundamentally great products don’t require experiments to demonstrate their greatness. And the insights from which they stem don’t arise from a process of iteratively optimizing a particular KPI. Let’s not kid ourselves that they do.

Elsewhere on the internet…

brought to you this week by Erica Louie

Building more effective data teams using the JTBD framework by Emilie Schario

Coming off of Snowflake Summit, a conference with two large rooms decorated with Enterprise vendors pitching upgraded tooling, new features, or breaking into a new piece of the Modern Data Stack, I was craving conversations around what Emilie covers in this article. As we see more data teams rapidly maturing and growing, our data problems grow in parallel and become increasingly more complex. These are often solved via internal tools that become external SaaS products or new players in the data market. However, as we get caught up in the newest tools, we forget the purpose of a data team: to drive business growth through fast and reliable insights.

While tooling allows us to make these decisions faster, we shouldn’t over index on technology. Rather, Emilie pitches a framework for data teams to follow in this article: the “Jobs to be done” (JTBD) framework, which include:

data activation – making operational data available to the teams that need it
metrics management – the business needs shared definitions and a baseline of key metrics
proactive insight discovery – team members outside of the data team are limited in the questions they can ask by their limited knowledge of what data exists and what questions can be asked
driving experimentation – driving measurable impact to the business through A/B experimentation moving key business metrics in the right direction
interfacing with the data – empowering team members across the business with the information and conclusions they need to be unblocked

Emilie argues that every stakeholder question will fall under a job category in the JTBD framework which should directly correlate to some business impact. I love this model because conversations and playbooks around how a company used XYZ combination of tools to improve their data product experience won’t apply to everyone. But, the conversations around business problems, the thinking process and solutions, and ultimately the impact on the business tell a much more interesting and more universally relatable story.

Why Are We Still Struggling To Answer How Many Active Customers We Have? By SeattleDataGuy

This article touches on something that hits home for our internal data team – it’s hard to answer some of our most basic questions about the business. Similar to SeattleDataGuy’s reasoning, a majority of this can be due to some of the underlying components, that comprise or assist in defining these metrics, are also constantly shifting. The catalysts of these shifts often range from historical knowledge of the data or processes that leave with a turnover or remain in folks’ heads without proper documentation, platform migrations or changeovers. He mostly discusses software developer-specific reasons, migrations (e.g. ERPs and CRMs), and an overall lack of understanding for metrics and their definition process. One item I was thinking he was overlooking is sometimes…definitions of metrics just change as you improve your understanding of the data, what questions you should be really asking, and/or expand on your data tracking.

Randy Au also comments on SeattleDataGuy’s post in We take our units of analysis for granted for this exact reason to which I wholeheartedly agree! To quote Randy:

We don't know what we're doing, we might be asking the wrong research question, our unit of analysis is likely wrong in some way.

In the beginning, all business units feel intuitive and simple. Then they suddenly aren’t.

For example, in the beginning you have companies and maybe one company is one singular Salesforce Account. But then a year later, your Sales team lands a deal with a company with various subsidiaries. Then you’re introduced to organizations vs companies. So when someone asks, “What is the average number of Cloud seats per company”, the answer shifts from a simple answer to asking a follow-up question instead, “How do you define a company?” and then answering the question.

We’re undergoing defining these business units, or “entities”, on the internal data team at dbt Labs for this specific reason. And rather than starting off with what we believe are the core business units, we brainstormed what are most frequently asked questions about the business and what are the inputs to our North Star metrics?

Data Product in Changing Environments: Rethinking and Updating Investments by Eric Weber

And finally, Eric Weber takes the above conversations and zooms out a bit more to consider the data consumer when building your data product. Eric first touches on the summary of his original post regarding this subject and then further elaborates and reflects on his point after implementing this practice. When data teams begin building data products without considering the personas that they’re meant for is when your data product begins to fail. Taking the foundation of his argument and then elaborating on additional points can be summed in the following:

When building data assets, think of a specific “persona” in mind and critically identify who the company belongs within that persona and will instinctively find value in the asset you’re building.
It’s okay to prioritize personas. I am admittedly awful at this – I want to serve everyone all at once! But that isn’t realistic. Setting clear expectations that you need to prioritize other more high-leverage areas of the business, and specifically call out why, is key in the success of the team and the company.
Plan for what you’d drop - this will always be a difficult conversation. So often data teams want to think at scale and continue supporting as much of the business as possible. But this comes at a cost of constantly hiring and becoming bogged down with more requests. So it’s an equally important exercise to consider what the Data team would need to drop during difficult times (e.g. lack of headcount, needing to scale back, etc). This could be a specific product or a specific persona. However, my suggestion would be to ensure the persona being “dropped” is given enough documentation and/or resources to allow them to have support in lieu of asynchronous documentation so your team can continue moving forward with other more high-leverage tasks.
Follow your company’s prioritization in an area of investment or business function, and think about how your product investment should reflect (or not) those shifts. This is absolutely true and it’s also an easy segue into direct impact to the company. When data leaders are able to understand exactly what’s going in the minds of the Executive team, they can effectively support those initiatives or future strategies by ensuring there’s focus to creating, improving, or digging into data products that circulate around those areas of the business.

SeattleDataGuy

Jun 26, 2022Edited

Thanks for including my article! There are probably a dozen more reasons why teams struggle to answer basic questions and I appreciate the perspective you added.

Expand full comment

Anu Sharma

Jun 27, 2022Edited

Agree that A/B testing sucks today but the idea that experimentation is overused entirely misses the forest for the trees. Experimentation is a fundamental cultural trait of inventive teams and is heavily underused if anything. You can't invent without experimenting... https://blog.statsig.com/culture-of-experimentation-7a81a1aaa571

The Analytics Engineering Roundup

Discussion about this post