A Program of Experiments With Risk, Context, and Organizational Constraints
An experimentation program is a portfolio problem: at each time $t$, given what we currently know and today's constraints (backlog, traffic budget, test and implementation costs), we must choose a slate of ideas to test, decide how much traffic to allocate, and determine which experiments to continue, stop, or ship with the aim to raising a chosen business metric over a horizon, net of costs. Crucially, a portfolio only makes sense relative to a risk preference: maximizing expected gain while tolerating a meaningful chance of ruin is rarely acceptable in practice.
Formally, if the policy $\pi$ chooses slates and allocations each period, the risk‑neutral objective is the expected cumulative net outcome:
$$J(\pi) = \mathbb{E}_{\pi}[G(\pi)],$$
where $G(\pi)$ is "total shipped lift minus testing and implementation costs" across the horizon under $\pi$. The per‑period traffic budget is enforced operationally (we allocate at most the available traffic each $t$).
That baseline is expectation‑only. To make the program cognize risk, track a buffer $W_0$ (cash/KPI headroom) and define terminal headroom $W_T = W_0 + G(\pi)$. Impose a simple survival/tail requirement. Under a Normal approximation for $G(\pi)$ with mean $\mu_\pi$ and standard deviation $\sigma_\pi$, bounding the probability of ruin by $\varepsilon$ reduces to:
$$W_0 + \mu_\pi \geq z_{1-\varepsilon}\sigma_\pi,$$
where:
- $W_0$: starting buffer (how much pain you can absorb),
- $\mu_\pi$: expected net gain over the horizon under policy $\pi$,
- $\sigma_\pi$: standard deviation of that gain (portfolio volatility),
- $z_{1-\varepsilon}$: standard Normal quantile, though if normal tails are suspect, we can use an assumption‑light alternative, e.g., cap CVaR: the expected loss in the worst $\varepsilon%$ outcomes, in place of the Normal‑based rule.
A standard modeling simplification is to assume a prior over idea effects. In its simplest form, draw effects from a distribution $G$. Heavy‑tailed $G$ with modest launch costs favors small screens that cheaply find rare big wins; tightly bounded effects or high launch costs favor larger, confirmatory tests. The return calculus (expected lift minus costs) is the same, but it now runs subject to the survival/tail requirement above, which prevents "great on average, dangerous in the tails" portfolios.
Three concrete additions make the stylized model match day‑to‑day practice:
- Near‑term performance floor. Programs are asked to "show results" on a cadence. Treat that as a constraint on period‑$t$ net contribution, or equivalently, reserve some traffic for exploitation. A light formalism is:
$$\mathbb{E}_{\pi}[R_t] \geq \rho_t\quad\text{for each }t,$$
with $R_t$ the period's net contribution and $\rho_t$ the floor. With this in place, "when to stop exploring" becomes a budget question: explore while its marginal value exceeds the opportunity cost implied by $\rho_t$.
- Contextual, time‑adaptive priors (with family‑level bounds). Ideas arrive with metadata—surface, mechanism family, novelty, seasonality, build cost, team, interference risk—and domain theory often bounds some families tightly (e.g., color tweaks) while others plausibly allow larger moves (e.g., ranking/pricing). Move from a static $G$ to a conditional, time‑adaptive $G(\Delta \mid x,t)$: shrink noisy estimates toward context‑specific means, allow heavier tails when diagnostics suggest, and state explicit bounds where appropriate.
- Endogenous candidate generation. Publishing priors and scoring rules changes what gets proposed next. Treat proposal mix as drifting and keep a small, explicit scouting budget for under‑represented families so the backlog doesn't collapse to yesterday's winners.