Recommended Treatment
The effect of a treatment on an outcome depends on three things: the kind of treatment, the kind of person receiving it, and the context in which it is delivered. A ranking change and a button color change work through different mechanisms. A heavy user and a new user will not respond in the same way. And the same treatment can matter differently depending on timing, platform state, or what else is happening at the same moment.
Call these three inputs $x_{\text{trt}}$, $x_{\text{person}}$, and $x_{\text{context}}$. Then the treatment effect is
$$ \delta(x_{\text{trt}}, x_{\text{person}}, x_{\text{context}}). $$
The average treatment effect, subgroup-specific effects, and the eventual decision about whether a treatment is worth implementing are all derived from this more basic function.
This makes experimentation look a lot like recommendation. In a recommender system, the core question is how much person $j$ will like item $i$ in context $c$. Here, the question is how much person $j$ will respond to treatment $i$ in context $c$. Replace item with treatment and rating with effect, and the structure is the same. The experimentation program is trying to predict a treatment-effect surface and act on it.
That prediction enters twice.
Before the experiment, the program has a backlog of candidate treatments and a traffic budget. It has to decide which experiments to run, and sometimes on whom. At that point it only has $(x_{\text{trt}}, x_{\text{person}}, x_{\text{context}})$, so the relevant object is the prior predictive distribution of $\delta$.
After the experiment, the program observes a noisy estimate $\hat{\delta}$ with standard error $\sigma$. The relevant object is now the posterior distribution of $\delta$ given the prior and the data. This is where shrinkage enters: the posterior mean pulls $\hat{\delta}$ toward the model's pre-test prediction. Regression adjustment for variance reduction matters because it reduces $\sigma^2$, making the data more informative relative to the prior. The final decision then follows from the posterior: implement the treatment when the posterior expected value clears the relevant threshold.
How well the program performs depends on how much of $\delta(x_{\text{trt}}, x_{\text{person}}, x_{\text{context}})$ the model captures. Existing approaches differ mainly in which parts of this object they model, usually for sensible reasons given the data they assume.
Start with the weakest common setting: experiment-level summaries only. You observe point estimates and standard errors, but no metadata on treatments, people, or context. Classical A/B testing analyzes each experiment in isolation. The estimand is the average treatment effect, tested against zero, with no attempt to borrow strength across experiments.
Once a program accumulates many such summaries, it can do better. Azevedo, Deng, Montiel Olea, and Weyl (2020) fit a common distribution $F$ over experiment-level average treatment effects and shrink noisy estimates toward the population mean. This is the James-Stein idea in an experimentation setting. It improves mean squared error by pooling information across experiments. The exchangeability assumption is natural here: if there is no metadata that distinguishes experiments, there is nothing else to condition on.
Now add treatment metadata: surface, mechanism, team, size of change, and related features. Most experimentation programs already log this information. They just do not usually use it for prediction. Once treatment metadata enters, full exchangeability becomes much less plausible. A ranking change and a button color change should not be treated as draws from the same effect distribution.
A natural next step is a mixture model with latent treatment types:
$$ \delta_i \mid z_i = k \sim F_k(\mu_k, \tau_k^2), \qquad z_i \mid x_{\text{trt},i} \sim \text{Categorical}(\pi(x_{\text{trt},i})). $$
The posterior mean for experiment $i$ becomes
$$E[\delta_i \mid \hat\delta_i, x_{\text{trt},i}] = \sum_k w_{ik} , E[\delta_i \mid \hat\delta_i, z_i = k],$$
where $w_{ik}$ is the posterior probability that experiment $i$ belongs to type $k$. Each type has its own center and dispersion, so shrinkage becomes type-specific. Experiments in a tight, predictable class get shrunk more aggressively. Experiments in a diffuse class keep more of their raw estimate. The fully pooled model is the special case $K = 1$.
This richer prior also improves pre-experiment allocation. A ranking change and a color tweak now have different predictive distributions before any data comes in — different expected effects, different uncertainty — so the program can distinguish which candidates are more promising and which are more informative to test. The mixture also provides a full predictive distribution, not just a point estimate, which matters when decisions involve asymmetric losses or option value.
Now move to the richest setting: person-level outcomes observed across many experiments. At that point the problem changes qualitatively. The target is no longer an experiment-level effect. It is the full surface
$$ \delta(x_{\text{trt}}, x_{\text{person}}, x_{\text{context}}) $$
for treatment-person-context triples. And the allocation question is no longer just which experiments to run. It is which treatment to show to which person.
The heterogeneous-treatment-effects literature already operates on the person dimension. Methods such as causal forests, doubly robust learners, and large-scale personalization systems estimate how treatment effects vary with person-level covariates, then route people toward the arm with the highest predicted gain. Given their data, this is the right problem formulation. They typically have rich person-level outcomes within a single experiment.
But they usually do not share information across experiments in a structured way. Each experiment gets modeled largely on its own. That too is understandable. If the system does not have a usable representation of treatment features across a large corpus of experiments, there is little basis for transfer.
The more ambitious version would model treatments and people jointly across the entire portfolio. The system would take a backlog of candidate experiments and a population of users, predict person-treatment effects jointly, and allocate users across experiments rather than only across arms within a single experiment. In that world, the program is no longer evaluating one idea at a time on random slices of traffic. It is continuously recommending treatments to people under uncertainty, while learning from the results.