By Gaurav in ML/Statistics — 15 Jul 2025

Always-On Probability Calibration with Multiplicative Weights

Modern ad systems, recommendation pipelines, and risk models often rely on predicted probabilities—click-through rates (CTRs), conversion rates, or user intents. But raw model outputs are rarely well-calibrated. Left unchecked, miscalibration can distort bids, break fairness goals, or erode user trust.

The Problem

Suppose your model produces raw probabilities, $p_i^{\text{raw}}$, for binary outcomes, $y_i \in {0,1}$. Over time, model calibration drifts due to changing user behavior, traffic patterns, or creative mix. For each calibration bucket, $B_b$, we want

$$
\frac{1}{|B_b|} \sum_{i \in B_b} y_i \approx \frac{1}{|B_b|} \sum_{i \in B_b} p_i^{\text{cal}}
$$

That is: the average calibrated prediction should match the empirical outcome rate within each bucket—without interrupting traffic or retraining models.

Standard Approaches

Two common post-hoc calibration methods are:

Platt scaling (logistic regression on raw scores)
Isotonic regression (non-parametric, monotone fit via PAV)

These are typically run as batch jobs—retraining hourly or nightly. This creates a trade-off:

Infrequent retraining → calibration drifts
Frequent retraining → compute costs rise with cumulative traffic

Both approaches require re-accessing large volumes of past data.

A Lightweight Alternative: MWU

We revisit this problem through the lens of Multiplicative Weights Updates (MWU). Each calibration bucket, $b$, has a bias factor, $c_b$, which gets nudged after each mini-batch of traffic using a single exponential update:

$$
c_b \leftarrow c_b \cdot \exp\bigl(-\eta (\tilde r_b - \hat r_b)\bigr)
$$

Where:

$\hat r_b$ is the observed outcome rate (e.g., click-through)
$\tilde r_b$ is the predicted average for that bucket
$\eta$ is a learning rate

This update:

Requires no solver, matrix inverse, or stored history
Takes constant time per batch (independent of traffic volume)
Provides online adaptation with provable regret bounds

Experimental Results

We ran a synthetic ad stream with 200k impressions and induced drift. Each batch (5k events) was calibrated using:

Platt scaling (retrained per batch)
Isotonic regression (retrained per batch)
MWU (streaming update)

Method	Mean Brier	Mean CPU Time (s)
Platt	0.2051	0.0243
Isotonic	0.2045	0.0181
MWU	0.2052	0.00039

All three methods achieved comparable calibration error. But MWU required 60–100× less compute, with no access to historical data and no periodic retrain.

When to Use This

MWU is well-suited for:

High-frequency environments (ads, ranking, serving)
Cases with frequent drift and latency constraints
Systems where retraining schedules are brittle or costly

It is not a drop-in replacement for full model retrains, nor does it fix deep model mis-specification. But for lightweight, incremental correction, it’s a strong, principled baseline.

More at: https://github.com/finite-sample/mw-calibration

The Problem

Standard Approaches

A Lightweight Alternative: MWU

Experimental Results

When to Use This

Subscribe to Gojiberries