By Gaurav in ML/Statistics — 26 Mar 2026

No Logbook, No Problem: Recovering LATE When You Can't Observe Compliance

Suppose you run a randomized experiment with imperfect compliance. You assigned treatment, measured outcomes, and now you want the LATE. The Wald estimator says divide the ITT by the compliance rate, $\pi = E[D_i \mid Z_i = 1] - E[D_i \mid Z_i = 0]$, and you're done. Except you can't observe $D_i$. Maybe the delivery script ran overnight and the logs got overwritten. Maybe the platform doesn't surface unit-level attribution. The ITT is there for the taking, but it's attenuated by $\pi$, and without knowing $\pi$ you can't get the LATE.

This looks like a dead end, but it isn't, as long as your experiment has a particular temporal structure. Many experiments do. The treatment is applied over a known delivery window that is separate from the outcome measurement window, and each unit has a pre-treatment history that characterizes its baseline behavior. When the treatment, successfully delivered, produces a large deterministic shift during the delivery window, compliers and never-takers look different in a way that a statistical test can pick up.

The test itself is simple. For each treated unit, compare its mean outcome during the delivery window to its pre-treatment mean, using the pre-treatment standard deviation as the reference:

$$t_i = \frac{\bar{y}{i,\text{treat}} - \bar{y}{i,\text{pre}}}{\hat{\sigma}i \sqrt{1/T{\text{pre}} + 1/T_{\text{treat}}}}$$

Classify unit $i$ as a complier ($\hat{D}_i = 1$) if $p_i < \alpha$, and as a never-taker otherwise. Control units get $\hat{D}i = 0$ by construction (one-sided noncompliance). The inferred compliance rate is $\hat{\pi} = \bar{\hat{D}}{Z=1}$, the share of treated units that pass the test.

The trouble with $\hat{\pi}$ is that it overestimates true compliance, because some never-takers will pass the test by chance. But the overestimation is well-characterized: the false positive rate among never-takers is $\alpha$, the test size, by construction. So the expected inferred compliance rate decomposes as $\hat{\pi} = \pi \cdot \text{TPR} + (1-\pi) \cdot \alpha$, where TPR is the probability that a real complier passes the test. When the mechanical dose is large relative to organic noise, TPR goes to 1 and the expression collapses to $\hat{\pi} = \pi + (1-\pi)\alpha$. This inverts to give a corrected compliance rate:

$$\pi_c = \frac{\hat{\pi} - \alpha}{1 - \alpha}$$

and a corrected LATE:

$$\widehat{\text{LATE}}_c = \frac{(1-\alpha) \cdot \text{ITT}}{\hat{\pi} - \alpha}$$

The correction is subtracting the expected number of false positives from the inferred complier count and rescaling. In simulations calibrated to a real experiment (details below), $\hat{\pi}$ comes in at 0.761, the correction brings it to 0.748, and the true $\pi$ is 0.750.

A corrected point estimate is only half the job. The corrected LATE is a ratio of two estimated quantities, the ITT and $\hat{\pi}$, and we need a standard error. The delta method applies directly. Writing $f(\text{ITT}, \hat{\pi}) = (1-\alpha) \cdot \text{ITT} / (\hat{\pi} - \alpha)$, the partial derivatives are:

$$\frac{\partial f}{\partial \text{ITT}} = \frac{1-\alpha}{\hat{\pi} - \alpha}, \qquad \frac{\partial f}{\partial \hat{\pi}} = -\frac{(1-\alpha) \cdot \text{ITT}}{(\hat{\pi} - \alpha)^2}$$

What makes this clean is that $\text{Var}(\text{ITT})$ and $\text{Var}(\hat{\pi})$ are independent: the ITT comes from post-treatment outcome data and $\hat{\pi}$ comes from delivery-window data, two non-overlapping temporal windows. The variance of the corrected LATE is just the sum of the two squared-gradient terms, each multiplied by the corresponding variance. The 95% CI is $\widehat{\text{LATE}}_c \pm 1.96 \cdot \hat{\text{se}}$.

This independence is worth pausing on, because it is the same temporal structure doing double duty. The three-window design, pre-treatment, delivery, post-treatment, is what makes the compliance classification possible in the first place (you need a baseline to test against and a delivery window to test on). And it is also what makes the inference tractable (the classification and the outcome use different data). The connection to the broader "test-and-select" literature is direct: Hazard and Löwe (2023) propose selecting subgroups with nonzero first stages and using cross-fitting across sample units for valid inference; Chernozhukov et al. (2018) provide the general DML framework for two-step procedures where nuisance parameters are estimated before the target. In our setting, the temporal structure of the experiment provides a stronger form of the same separation, not a random partition of units but a structural partition of the data-generating process.

So much for the theory. Does the CI actually cover?

I simulate 1,000 experiments at each of 12 parameter configurations, using a setup calibrated to a real experiment: 500 software packages (250 treated, 250 control), 30 days of pre-treatment data, a 6-day delivery window with 17 mechanical downloads per day, and a 30-day post-treatment window with a social proof effect of $\tau$ additional organic downloads per day for compliers. The true LATE is $\tau \times 30$.

At the baseline ($\pi = 0.75$, $\tau = 3$, true LATE $= 90$), the corrected LATE averages 90.0, the delta method 95% CI has coverage of 0.974, and the mean CI width is 42.6. The results across scenarios:

Scenario	LATE	Bias	Coverage	Width
Baseline ($\pi$=.75, $\tau$=3)	89.4	-0.6	0.974	42.6
Low compliance ($\pi$=.50)	89.5	-0.5	0.972	65.9
High compliance ($\pi$=.90)	89.9	-0.1	0.959	33.7
Small effect ($\tau$=1)	29.6	-0.4	0.956	39.0
Large effect ($\tau$=10)	300.1	0.1	1.000	73.1
Null effect ($\tau$=0)	-0.3	-0.3	0.947	38.4
Small sample (N=100)	89.7	-0.3	0.974	96.9
Large sample (N=2000)	89.7	-0.3	0.969	21.3
High noise ($\sigma$=5)	89.6	-0.4	0.969	39.5
Weak signal (mech=5/day)	90.0	0.0	0.967	42.8
Short pre-period (T=7)	89.7	-0.3	0.963	42.6
Tighter test ($\alpha$=.01)	90.0	0.0	0.974	42.7

Coverage ranges from 0.947 to 0.974, right where a slightly conservative 95% CI should land. It holds at the null ($\tau = 0$, coverage 0.947), which matters: the CI correctly includes zero when there is no effect. It holds at $\pi = 0.50$ (coverage 0.972), though the CI is wider because compliance uncertainty enters the denominator and gets amplified. Width scales with $1/\sqrt{N}$ as expected. Bootstrap percentile CIs, computed at baseline with 300 draws per simulation, give coverage of 0.934 with a tighter width of 38.0. The slight undercoverage is the usual finite-sample behavior of the percentile bootstrap.

The method breaks down in one specific way: when the mechanical dose is small relative to organic noise, the $t$-test has low power, compliers get misclassified as never-takers (TPR $< 1$), and the correction formula, which assumes TPR $= 1$, overshoot. In a high-noise simulation ($\sigma = 5$) with only 5 mechanical downloads per day, sensitivity drops to 0.823, and the LATE estimate is biased upward by about 20%. At 10 mechanical downloads per day with the same noise, the method recovers fully. The practical diagnostic is whether $\hat{\pi}$ is lower than you would expect from what you know about your delivery process. If it is, the signal is too weak and you should report the ITT.

Subscribe to Gojiberries