No Logbook, No Problem: Recovering LATE When You Can't Observe Compliance

Suppose you run a randomized experiment with imperfect compliance. You assigned treatment, measured outcomes, and now you want the LATE. The Wald estimator says divide the ITT by the compliance rate, $\pi = E[D_i \mid Z_i = 1] - E[D_i \mid Z_i = 0]$, and you're done. Except you can't observe $D_i$. Maybe the delivery script ran overnight and the logs got overwritten. Maybe the platform doesn't surface unit-level attribution. The ITT is there for the taking, but it's attenuated by $\pi$, and without knowing $\pi$ you can't get the LATE.

This looks like a dead end, but it isn't, as long as your experiment has a particular temporal structure. Many experiments do. The treatment is applied over a known delivery window that is separate from the outcome measurement window, and each unit has a pre-treatment history that characterizes its baseline behavior. When the treatment, successfully delivered, produces a large deterministic shift during the delivery window, compliers and never-takers look different in a way that a statistical test can pick up.

The test itself is simple. For each treated unit, compare its mean outcome during the delivery window to its pre-treatment mean, using the pre-treatment standard deviation as the reference:

$$t_i = \frac{\bar{y}{i,\text{treat}} - \bar{y}{i,\text{pre}}}{\hat{\sigma}i \sqrt{1/T{\text{pre}} + 1/T_{\text{treat}}}}$$

Classify unit $i$ as a complier ($\hat{D}_i = 1$) if $p_i < \alpha$, and as a never-taker otherwise. Control units get $\hat{D}i = 0$ by construction (one-sided noncompliance). The inferred compliance rate is $\hat{\pi} = \bar{\hat{D}}{Z=1}$, the share of treated units that pass the test.

The trouble with $\hat{\pi}$ is that it overestimates true compliance, because some never-takers will pass the test by chance. But the overestimation is well-characterized: the false positive rate among never-takers is $\alpha$, the test size, by construction. So the expected inferred compliance rate decomposes as $\hat{\pi} = \pi \cdot \text{TPR} + (1-\pi) \cdot \alpha$, where TPR is the probability that a real complier passes the test. When the mechanical dose is large relative to organic noise, TPR goes to 1 and the expression collapses to $\hat{\pi} = \pi + (1-\pi)\alpha$. This inverts to give a corrected compliance rate:

$$\pi_c = \frac{\hat{\pi} - \alpha}{1 - \alpha}$$

and a corrected LATE:

$$\widehat{\text{LATE}}_c = \frac{(1-\alpha) \cdot \text{ITT}}{\hat{\pi} - \alpha}$$

The correction is subtracting the expected number of false positives from the inferred complier count and rescaling. In simulations calibrated to a real experiment (details below), $\hat{\pi}$ comes in at 0.761, the correction brings it to 0.748, and the true $\pi$ is 0.750.

A corrected point estimate is only half the job. The corrected LATE is a ratio of two estimated quantities, the ITT and $\hat{\pi}$, and we need a standard error. The delta method applies directly. Writing $f(\text{ITT}, \hat{\pi}) = (1-\alpha) \cdot \text{ITT} / (\hat{\pi} - \alpha)$, the partial derivatives are:

$$\frac{\partial f}{\partial \text{ITT}} = \frac{1-\alpha}{\hat{\pi} - \alpha}, \qquad \frac{\partial f}{\partial \hat{\pi}} = -\frac{(1-\alpha) \cdot \text{ITT}}{(\hat{\pi} - \alpha)^2}$$

What makes this clean is that $\text{Var}(\text{ITT})$ and $\text{Var}(\hat{\pi})$ are independent: the ITT comes from post-treatment outcome data and $\hat{\pi}$ comes from delivery-window data, two non-overlapping temporal windows. The variance of the corrected LATE is just the sum of the two squared-gradient terms, each multiplied by the corresponding variance. The 95% CI is $\widehat{\text{LATE}}_c \pm 1.96 \cdot \hat{\text{se}}$.

This independence is worth pausing on, because it is the same temporal structure doing double duty. The three-window design, pre-treatment, delivery, post-treatment, is what makes the compliance classification possible in the first place (you need a baseline to test against and a delivery window to test on). And it is also what makes the inference tractable (the classification and the outcome use different data). The connection to the broader "test-and-select" literature is direct: Hazard and Löwe (2023) propose selecting subgroups with nonzero first stages and using cross-fitting across sample units for valid inference; Chernozhukov et al. (2018) provide the general DML framework for two-step procedures where nuisance parameters are estimated before the target. In our setting, the temporal structure of the experiment provides a stronger form of the same separation, not a random partition of units but a structural partition of the data-generating process.

So much for the theory. Does the CI actually cover?

I simulate 1,000 experiments at each of 12 parameter configurations, using a setup calibrated to a real experiment: 500 software packages (250 treated, 250 control), 30 days of pre-treatment data, a 6-day delivery window with 17 mechanical downloads per day, and a 30-day post-treatment window with a social proof effect of $\tau$ additional organic downloads per day for compliers. The true LATE is $\tau \times 30$.

At the baseline ($\pi = 0.75$, $\tau = 3$, true LATE $= 90$), the corrected LATE averages 90.0, the delta method 95% CI has coverage of 0.974, and the mean CI width is 42.6. The results across scenarios:

Scenario LATE Bias Coverage Width
Baseline ($\pi$=.75, $\tau$=3) 89.4 -0.6 0.974 42.6
Low compliance ($\pi$=.50) 89.5 -0.5 0.972 65.9
High compliance ($\pi$=.90) 89.9 -0.1 0.959 33.7
Small effect ($\tau$=1) 29.6 -0.4 0.956 39.0
Large effect ($\tau$=10) 300.1 0.1 1.000 73.1
Null effect ($\tau$=0) -0.3 -0.3 0.947 38.4
Small sample (N=100) 89.7 -0.3 0.974 96.9
Large sample (N=2000) 89.7 -0.3 0.969 21.3
High noise ($\sigma$=5) 89.6 -0.4 0.969 39.5
Weak signal (mech=5/day) 90.0 0.0 0.967 42.8
Short pre-period (T=7) 89.7 -0.3 0.963 42.6
Tighter test ($\alpha$=.01) 90.0 0.0 0.974 42.7

Coverage ranges from 0.947 to 0.974, right where a slightly conservative 95% CI should land. It holds at the null ($\tau = 0$, coverage 0.947), which matters: the CI correctly includes zero when there is no effect. It holds at $\pi = 0.50$ (coverage 0.972), though the CI is wider because compliance uncertainty enters the denominator and gets amplified. Width scales with $1/\sqrt{N}$ as expected. Bootstrap percentile CIs, computed at baseline with 300 draws per simulation, give coverage of 0.934 with a tighter width of 38.0. The slight undercoverage is the usual finite-sample behavior of the percentile bootstrap.

The method breaks down in one specific way: when the mechanical dose is small relative to organic noise, the $t$-test has low power, compliers get misclassified as never-takers (TPR $< 1$), and the correction formula, which assumes TPR $= 1$, overshoot. In a high-noise simulation ($\sigma = 5$) with only 5 mechanical downloads per day, sensitivity drops to 0.823, and the LATE estimate is biased upward by about 20%. At 10 mechanical downloads per day with the same noise, the method recovers fully. The practical diagnostic is whether $\hat{\pi}$ is lower than you would expect from what you know about your delivery process. If it is, the signal is too weak and you should report the ITT.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe