Always-On Probability Calibration with Multiplicative Weights

Always-On Probability Calibration with Multiplicative Weights
Photo by Ries Bosch / Unsplash

Modern ad systems, recommendation pipelines, and risk models often rely on predicted probabilities—click-through rates (CTRs), conversion rates, or user intents. But raw model outputs are rarely well-calibrated. Left unchecked, miscalibration can distort bids, break fairness goals, or erode user trust.

The Problem

Suppose your model produces raw probabilities, $p_i^{\text{raw}}$, for binary outcomes, $y_i \in {0,1}$. Over time, model calibration drifts due to changing user behavior, traffic patterns, or creative mix. For each calibration bucket, $B_b$, we want

$$
\frac{1}{|B_b|} \sum_{i \in B_b} y_i \approx \frac{1}{|B_b|} \sum_{i \in B_b} p_i^{\text{cal}}
$$

That is: the average calibrated prediction should match the empirical outcome rate within each bucket—without interrupting traffic or retraining models.

Standard Approaches

Two common post-hoc calibration methods are:

  • Platt scaling (logistic regression on raw scores)
  • Isotonic regression (non-parametric, monotone fit via PAV)

These are typically run as batch jobs—retraining hourly or nightly. This creates a trade-off:

  • Infrequent retraining → calibration drifts
  • Frequent retraining → compute costs rise with cumulative traffic

Both approaches require re-accessing large volumes of past data.

A Lightweight Alternative: MWU

We revisit this problem through the lens of Multiplicative Weights Updates (MWU). Each calibration bucket, $b$, has a bias factor, $c_b$, which gets nudged after each mini-batch of traffic using a single exponential update:

$$
c_b \leftarrow c_b \cdot \exp\bigl(-\eta (\tilde r_b - \hat r_b)\bigr)
$$

Where:

  • $\hat r_b$ is the observed outcome rate (e.g., click-through)
  • $\tilde r_b$ is the predicted average for that bucket
  • $\eta$ is a learning rate

This update:

  • Requires no solver, matrix inverse, or stored history
  • Takes constant time per batch (independent of traffic volume)
  • Provides online adaptation with provable regret bounds

Experimental Results

We ran a synthetic ad stream with 200k impressions and induced drift. Each batch (5k events) was calibrated using:

  • Platt scaling (retrained per batch)
  • Isotonic regression (retrained per batch)
  • MWU (streaming update)
Method Mean Brier Mean CPU Time (s)
Platt 0.2051 0.0243
Isotonic 0.2045 0.0181
MWU 0.2052 0.00039

All three methods achieved comparable calibration error. But MWU required 60–100× less compute, with no access to historical data and no periodic retrain.

When to Use This

MWU is well-suited for:

  • High-frequency environments (ads, ranking, serving)
  • Cases with frequent drift and latency constraints
  • Systems where retraining schedules are brittle or costly

It is not a drop-in replacement for full model retrains, nor does it fix deep model mis-specification. But for lightweight, incremental correction, it’s a strong, principled baseline.

More at: https://github.com/finite-sample/mw-calibration

See also: https://www.gojiberries.io/calibration/

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe