Always-On Probability Calibration with Multiplicative Weights
Modern ad systems, recommendation pipelines, and risk models often rely on predicted probabilities—click-through rates (CTRs), conversion rates, or user intents. But raw model outputs are rarely well-calibrated. Left unchecked, miscalibration can distort bids, break fairness goals, or erode user trust.
The Problem
Suppose your model produces raw probabilities, $p_i^{\text{raw}}$, for binary outcomes, $y_i \in {0,1}$. Over time, model calibration drifts due to changing user behavior, traffic patterns, or creative mix. For each calibration bucket, $B_b$, we want
$$
\frac{1}{|B_b|} \sum_{i \in B_b} y_i \approx \frac{1}{|B_b|} \sum_{i \in B_b} p_i^{\text{cal}}
$$
That is: the average calibrated prediction should match the empirical outcome rate within each bucket—without interrupting traffic or retraining models.
Standard Approaches
Two common post-hoc calibration methods are:
- Platt scaling (logistic regression on raw scores)
- Isotonic regression (non-parametric, monotone fit via PAV)
These are typically run as batch jobs—retraining hourly or nightly. This creates a trade-off:
- Infrequent retraining → calibration drifts
- Frequent retraining → compute costs rise with cumulative traffic
Both approaches require re-accessing large volumes of past data.
A Lightweight Alternative: MWU
We revisit this problem through the lens of Multiplicative Weights Updates (MWU). Each calibration bucket, $b$, has a bias factor, $c_b$, which gets nudged after each mini-batch of traffic using a single exponential update:
$$
c_b \leftarrow c_b \cdot \exp\bigl(-\eta (\tilde r_b - \hat r_b)\bigr)
$$
Where:
- $\hat r_b$ is the observed outcome rate (e.g., click-through)
- $\tilde r_b$ is the predicted average for that bucket
- $\eta$ is a learning rate
This update:
- Requires no solver, matrix inverse, or stored history
- Takes constant time per batch (independent of traffic volume)
- Provides online adaptation with provable regret bounds
Experimental Results
We ran a synthetic ad stream with 200k impressions and induced drift. Each batch (5k events) was calibrated using:
- Platt scaling (retrained per batch)
- Isotonic regression (retrained per batch)
- MWU (streaming update)
Method | Mean Brier | Mean CPU Time (s) |
---|---|---|
Platt | 0.2051 | 0.0243 |
Isotonic | 0.2045 | 0.0181 |
MWU | 0.2052 | 0.00039 |
All three methods achieved comparable calibration error. But MWU required 60–100× less compute, with no access to historical data and no periodic retrain.
When to Use This
MWU is well-suited for:
- High-frequency environments (ads, ranking, serving)
- Cases with frequent drift and latency constraints
- Systems where retraining schedules are brittle or costly
It is not a drop-in replacement for full model retrains, nor does it fix deep model mis-specification. But for lightweight, incremental correction, it’s a strong, principled baseline.
More at: https://github.com/finite-sample/mw-calibration
See also: https://www.gojiberries.io/calibration/