No Logbook, No Problem: Recovering LATE When You Can't Observe Compliance
Suppose you run a randomized experiment with imperfect compliance. You assigned treatment, measured outcomes, and now you want the LATE. The Wald estimator says divide the ITT by the compliance rate, $\pi = E[D_i \mid Z_i = 1] - E[D_i \mid Z_i = 0]$, and you're done. Except you can't observe $D_i$. Maybe the delivery script ran overnight and the logs got overwritten. Maybe the platform doesn't surface unit-level attribution. The ITT is there for the taking, but it's attenuated by $\pi$, and without knowing $\pi$ you can't get the LATE.
This looks like a dead end, but it isn't, as long as your experiment has a particular temporal structure. Many experiments do. The treatment is applied over a known delivery window that is separate from the outcome measurement window, and each unit has a pre-treatment history that characterizes its baseline behavior. When the treatment, successfully delivered, produces a large deterministic shift during the delivery window, compliers and never-takers look different in a way that a statistical test can pick up.
The test itself is simple. For each treated unit, compare its mean outcome during the delivery window to its pre-treatment mean, using the pre-treatment standard deviation as the reference:
$$t_i = \frac{\bar{y}{i,\text{treat}} - \bar{y}{i,\text{pre}}}{\hat{\sigma}i \sqrt{1/T{\text{pre}} + 1/T_{\text{treat}}}}$$
Classify unit $i$ as a complier ($\hat{D}_i = 1$) if $p_i < \alpha$, and as a never-taker otherwise. Control units get $\hat{D}i = 0$ by construction (one-sided noncompliance). The inferred compliance rate is $\hat{\pi} = \bar{\hat{D}}{Z=1}$, the share of treated units that pass the test.
The trouble with $\hat{\pi}$ is that it overestimates true compliance, because some never-takers will pass the test by chance. But the overestimation is well-characterized: the false positive rate among never-takers is $\alpha$, the test size, by construction. So the expected inferred compliance rate decomposes as $\hat{\pi} = \pi \cdot \text{TPR} + (1-\pi) \cdot \alpha$, where TPR is the probability that a real complier passes the test. When the mechanical dose is large relative to organic noise, TPR goes to 1 and the expression collapses to $\hat{\pi} = \pi + (1-\pi)\alpha$. This inverts to give a corrected compliance rate:
$$\pi_c = \frac{\hat{\pi} - \alpha}{1 - \alpha}$$
and a corrected LATE:
$$\widehat{\text{LATE}}_c = \frac{(1-\alpha) \cdot \text{ITT}}{\hat{\pi} - \alpha}$$
The correction is subtracting the expected number of false positives from the inferred complier count and rescaling. In simulations calibrated to a real experiment (details below), $\hat{\pi}$ comes in at 0.761, the correction brings it to 0.748, and the true $\pi$ is 0.750.
A corrected point estimate is only half the job. The corrected LATE is a ratio of two estimated quantities, the ITT and $\hat{\pi}$, and we need a standard error. The delta method applies directly. Writing $f(\text{ITT}, \hat{\pi}) = (1-\alpha) \cdot \text{ITT} / (\hat{\pi} - \alpha)$, the partial derivatives are:
$$\frac{\partial f}{\partial \text{ITT}} = \frac{1-\alpha}{\hat{\pi} - \alpha}, \qquad \frac{\partial f}{\partial \hat{\pi}} = -\frac{(1-\alpha) \cdot \text{ITT}}{(\hat{\pi} - \alpha)^2}$$
What makes this clean is that $\text{Var}(\text{ITT})$ and $\text{Var}(\hat{\pi})$ are independent: the ITT comes from post-treatment outcome data and $\hat{\pi}$ comes from delivery-window data, two non-overlapping temporal windows. The variance of the corrected LATE is just the sum of the two squared-gradient terms, each multiplied by the corresponding variance. The 95% CI is $\widehat{\text{LATE}}_c \pm 1.96 \cdot \hat{\text{se}}$.
This independence is worth pausing on, because it is the same temporal structure doing double duty. The three-window design, pre-treatment, delivery, post-treatment, is what makes the compliance classification possible in the first place (you need a baseline to test against and a delivery window to test on). And it is also what makes the inference tractable (the classification and the outcome use different data). The connection to the broader "test-and-select" literature is direct: Hazard and Löwe (2023) propose selecting subgroups with nonzero first stages and using cross-fitting across sample units for valid inference; Chernozhukov et al. (2018) provide the general DML framework for two-step procedures where nuisance parameters are estimated before the target. In our setting, the temporal structure of the experiment provides a stronger form of the same separation, not a random partition of units but a structural partition of the data-generating process.
So much for the theory. Does the CI actually cover?
I simulate 1,000 experiments at each of 12 parameter configurations, using a setup calibrated to a real experiment: 500 software packages (250 treated, 250 control), 30 days of pre-treatment data, a 6-day delivery window with 17 mechanical downloads per day, and a 30-day post-treatment window with a social proof effect of $\tau$ additional organic downloads per day for compliers. The true LATE is $\tau \times 30$.
At the baseline ($\pi = 0.75$, $\tau = 3$, true LATE $= 90$), the corrected LATE averages 90.0, the delta method 95% CI has coverage of 0.974, and the mean CI width is 42.6. The results across scenarios:
| Scenario | LATE | Bias | Coverage | Width |
|---|---|---|---|---|
| Baseline ($\pi$=.75, $\tau$=3) | 89.4 | -0.6 | 0.974 | 42.6 |
| Low compliance ($\pi$=.50) | 89.5 | -0.5 | 0.972 | 65.9 |
| High compliance ($\pi$=.90) | 89.9 | -0.1 | 0.959 | 33.7 |
| Small effect ($\tau$=1) | 29.6 | -0.4 | 0.956 | 39.0 |
| Large effect ($\tau$=10) | 300.1 | 0.1 | 1.000 | 73.1 |
| Null effect ($\tau$=0) | -0.3 | -0.3 | 0.947 | 38.4 |
| Small sample (N=100) | 89.7 | -0.3 | 0.974 | 96.9 |
| Large sample (N=2000) | 89.7 | -0.3 | 0.969 | 21.3 |
| High noise ($\sigma$=5) | 89.6 | -0.4 | 0.969 | 39.5 |
| Weak signal (mech=5/day) | 90.0 | 0.0 | 0.967 | 42.8 |
| Short pre-period (T=7) | 89.7 | -0.3 | 0.963 | 42.6 |
| Tighter test ($\alpha$=.01) | 90.0 | 0.0 | 0.974 | 42.7 |
Coverage ranges from 0.947 to 0.974, right where a slightly conservative 95% CI should land. It holds at the null ($\tau = 0$, coverage 0.947), which matters: the CI correctly includes zero when there is no effect. It holds at $\pi = 0.50$ (coverage 0.972), though the CI is wider because compliance uncertainty enters the denominator and gets amplified. Width scales with $1/\sqrt{N}$ as expected. Bootstrap percentile CIs, computed at baseline with 300 draws per simulation, give coverage of 0.934 with a tighter width of 38.0. The slight undercoverage is the usual finite-sample behavior of the percentile bootstrap.
The method breaks down in one specific way: when the mechanical dose is small relative to organic noise, the $t$-test has low power, compliers get misclassified as never-takers (TPR $< 1$), and the correction formula, which assumes TPR $= 1$, overshoot. In a high-noise simulation ($\sigma = 5$) with only 5 mechanical downloads per day, sensitivity drops to 0.823, and the LATE estimate is biased upward by about 20%. At 10 mechanical downloads per day with the same noise, the method recovers fully. The practical diagnostic is whether $\hat{\pi}$ is lower than you would expect from what you know about your delivery process. If it is, the signal is too weak and you should report the ITT.