Discover more from The Analytics Engineering Roundup
Ep 36: Minimum Viable Experimentation (w/ Sean Taylor + Vijaye Raji)
How to build a product experimentation program that doesn't run out of gas?
Product experimentation is full of potholes for companies of any size, given the number of pieces (tooling, culture, process, persistence) that need to come together to be successful.
Sean Taylor (currently Motif Analytics, formerly Facebook + Lyft) and Vijaye Raji (currently Statsig, formerly Facebook + Microsoft) have navigated these failure modes, and are here to help you (hopefully) do the same.
This convo with Tristan + Julia is light on tooling + heavy on process: how to watch out for spillover effects in experiments, avoiding bias, how to run an experiment review, and why experiment throughput is a better indicator of success than individual experiment results.
Listen & subscribe from:
Key points from Sean and Vijaye in this episode:
Sean, you were talking about being from social science, being in an academic profession and then moving over to Facebook and working in experimentation. I'm just curious: how many of your peers came from an actual PhD background? How many of them shared this path that you were on?
Yeah, I remember when I first arrived at Facebook to start full-time, it was right after the emotional contagion controversy. So that experiment was a paper published in a very well-regarded scientific journal. And a very large experiment on many Facebook users. And it made their news feeds less positive, which is like people think of that as a harm. When you run experiments, you have to consider the risk of harm as a social scientist.
And we were at the time very excited about Facebook's potential as this large laboratory for studying humans. And I think that vision gradually became less realistic. As you're running a company and the mission of the company isn't to learn things about the world for social scientists but to make it more successful and make the product safer and better for users.
Gradually I started to pivot from social science questions to business questions, which is where I've landed. We don't subject users to the risk of harm under any circumstances unless we're going to learn something that's going to make their product so much better for them, that it's worth it.
But you have to accept, a very interesting time to be there.
Large companies like Microsoft, Facebook, and Lyft, have budgets where it actually can pay off for them to run big experiments on large user bases. What does experimentation actually look like at these large organizations?
Yeah. So what we are talking about is product experimentation here, and a lot of times when we talk to people, they bring up the statistical power as one of the things that they're worried about. And you need to have large amounts of people or samples in order to run experiments successfully.
And generally what we have found out is statistical power is a combination of a few different factors, and one of the biggest factors is your minimum detectable effect, which is how big of a change you're looking for. And when you're small companies, you're looking for larger changes and you're early in your journey, you're looking for larger changes, and that actually has a bigger impact on your statistical power.
I want to address take this opportunity because you don't need large sample sizes to actually run effective experiments on the product.
If you are looking for 0.1, 0.01% improvements in a particular metric, you definitely need hundreds of thousands or maybe even millions of samples. But for most companies that we work with we're always looking for five, 10% wins, for which you don't need that many samples.
That's an important element that we've been trying to address whenever we talk to people without concern in terms of experiments themselves I worked at Facebook and I've seen it firsthand. A lot of this kind of comes down to the culture.
In fact, some of these are actually self-fulfilling because the tools themselves shape the culture. I'll explain. When you have the right tooling, when you have the ability to like safeguard the user experience, it actually empowers people, engineers, and product managers to come up with newer ideas.
And these ideas don't have to be totally vetted in conference rooms. You don't have to have these debates sometimes it's actually much easier to just build it and put it out there and measure the impact of your change than to just debate it. And so what then happens is people are willing to be creative.
They build out their ideas and then we never block anyone from taking a very small sample size and then trying it out. And then, so you roll it out very slowly and see if there's any impact to any of the metrics that you care about. And if your new change actually ends up affecting some core metric negatively, then you shut down the experiment.
It does a few things. It makes distributed decision-making much easier. So there are teams that actually trust each other's experiment results because they all trust the tool and the outcome of the tool. And thereby decision making is happening everywhere and everybody trusts it. Secondly, it actually enables teams to be autonomous and move fast.
And they also have the creative outlet to go think about something new, try it out, and validate it. And I don't have to have to go through this large process. Then one of the other things that I've also noticed is the data scientists, it's hard enough to get good data scientists in companies, and then they don't like to check themselves to go on debugging or diagnosing past experiments. So when the tooling can do most of that heavy lift, then they're focused on more creative work for forward-looking product directions. And so I thought those are all like some of the downstream effects of measuring every product change that you make.
I think you also have to have a culture of being okay to shut down work that you've spent several months building and the numbers aren't showing the results you're hoping for. That's a big cultural acceptance from a company saying, “Hey, we're okay with spending many months of many engineers' time and it not paying off.” Would you agree?
Yeah. This is the interesting bit about personal attachment to the code you wrote is pretty crazy. Like when data tells you it's so much easier to like, okay, I'm going to drop this line of thinking and I'm going to restart something versus somebody else telling you.
And so I've noticed that as one of the downstream cultural effects of people don't get attached. The code or product you build and actually rely on data, which is fascinating, which is really cool.
It is one of the dirty little secrets of experimentation that like a very small fraction of experiments are successful.
So you think of this like platonic ideal, like it tested my thing and we got this great result, and now I get to get credit for making this big impact. But I think it's like one in 10, one in six. Experiments are like shippable according to whatever launch criteria you have. At least that's my experience at Facebook.
And, which means that most engineers who build a change and spend all this time building it end up in an experiment review meeting where they're being told that we're not going to ship what you did. And we have a very good reason for doing so because it either made the experience worse or we didn't see benefits from it.
And that could be quite frustrating for people. This idealized attitude of “Oh, it's fine, I'm going to do what the experimentation platform says”, isn't really exactly true because people get frustrated about that. Like they put a lot of time and effort. And in particular, designers I think often have good reasons for changing things that don't improve metrics necessarily, and maybe even make metrics worse.
But there's a philosophy to it that isn't encoded in the metrics entirely. And so you get into these debates a lot about whether the experiment truly reflects what you're really trying to do or not.
What is the minimal set of ingredients needed for a successful experiment from start to finish? What does that look like?
I think there's often a good social process to have a pre-review where you describe to your team and any other people effective what you're planning to test. And then you run the experiment and then a post review where you review what you learn from the test and like what decision there is to be made.
Often the decision is to run another test. And those are meetings to have a lot of people and a lot of stakeholders in because they want to understand what other people are doing. A Lyft marketplace experiment could conflict with one another in ways that could create problems. So having those meetings is a good way to make you picture it as being distributed and everybody can make independent changes, but that's not really true.
They can interact with one another in adverse ways. And then in the review meeting, you're trying to come to a decision consensus as an organization. And that's a very political process actually because there's some skin in the game for the people who built the new changes.
People have their pet projects or they're excited or they want to continue a line of work. And, that meeting can be really challenging to get through with a successful decision
Do promo packets include successful experiments but not failed experiments? Are there behavioral rails that make people want to skew towards calling things a success?
Yeah, I think there's definitely an element of that. And we've seen that too. Look, we are all people and there's generally attachment. It's hard to be totally clinical about that there, but I think, Sean, you mentioned one in 10 experiments actually succeed and in order to find that one successful experiment, you do have to experiment a lot and try out various different things.
At the end of the day, the tool is only going to give you data and inferences, and it's the product team's decision to decide. It's up to them to make the call on whether you want to ship it or not. And that's based on some of these side effects you may not have expected. And you have to rationalize that.
You want to understand what that means. And then sometimes we've shipped things even when the numbers say otherwise because we believe there's a bigger mountain on the other side. Once you cross this, there's something that is a global maxima that we're shooting for and that's okay.
And that's entirely up to the product team's decision. And when that happens, you do need a larger buy-in. The experiment review much more involved in stakeholders. When I see things just are very friction-free and just smoothly over is when there are no debates like that.
When there are small little changes that just have a big impact and then that gets entered into a pro packet and then people get credit for that. But yeah, I think, but in general, my experience has been like 80% of the code or 80% of the product features generally don't work right out of the bat.
And then there's also an element of hey, someone feels so strongly about this, so they're going to iterate on that feature until they exhaust all of the possibilities and then give up, or maybe they'll find the winning variant in one of those iterations. And all of those outcomes are possible.
It's rarely black or white.
Yeah, I like that perspective that you always learn something from an experiment, even the field ones, and often it's oh, there was just a bug in the implementation and we can fix that. So an experiment that isn't a success still contributes to your company being more successful a lot of the time.
But I do think that rewarding people only for the good wins and the ones that have positive metric wins is a little bit pathological. And really what we should be rewarding is experimentation throughput. And I made this argument in one of my performance reviews one time was like I ran 20 experiments this half and only one of them worked, but I ran 20 experiments.
And actually, that's a sign of a good process that we were able to keep the queue of experiments filled with ideas. And that means that okay, we got a little unlucky this half, but we have a good process. And so gearing too much on the outcome and not the process is like one of the pathologies of looking at only the metric wins because you're learning a lot by running experiments, and you're also can get on these unlucky streaks where you just don't happen to have a good idea for a little while. Or maybe the users don't like what you're trying.
We just couldn't function as a product and engineering organization if one in 20 experiments got through. My question for you is: what do you think is different, is it scale, or are we a different type of software?
An extremely important point to make is that when I'm running 20 experiments a half, it's usually configuration-based experiments, not new engineering work. So a lot of Facebook experiments, people don't know this, are not like moving a button around or changing the UI.
It's actually just like numbers and a configuration that determines the behavior of the system. So when you have algorithmic systems like this, it's like how much weight should we put on certain factors in the ranking algorithm in a very large design space that you can explore through experimentation, and it's basically free to generate a new experimental condition.
Because you just have to change a config file. And so that's a very different style of experimentation than most people are familiar with. But it also is probably one of the primary ways that a large algorithm company works. Lyft and Facebook are basically algorithms at the end of the day. The UI matters and you can make improvements through that. But how you rank the newsfeed on Facebook, how you rank search on Google, is the product in some ways. So the configuration file is the design space, it's not the UI. And so you can run a lot of experiments very quickly in that setting.