Sign in Subscribe

Topic

Statistics

A collection of 123 issues

What's Wrong? Adversarial LLM Judges With Their Own Evaluation Criteria

How should we evaluate what large language models say? Metrics like BLEU or exact match don't work well when answers are open-ended or involve reasoning. Human judgments are better—but they're costly and inconsistent. So increasingly, researchers have turned to LLMs themselves to act as evaluators.

WAR For Cricket

In baseball, WAR (Wins Above Replacement) is a comprehensive metric to quantify a player’s contribution to team success, relative to a replacement-level player. The idea has slowly migrated to other sports. In cricket, however, WAR remains underdeveloped. Most public evaluations still rely on averages, strike rates, or wickets — useful,

Pareto ML Deployments

In machine learning, a common deployment strategy is to replace an existing model with one that performs better overall. Another common strategy refines this approach by limiting deployment to user segments or regions where the improvements are clear. Both approaches allow regressions: new errors on cases that the old model

(Don't) Forget About It: Toward Pareto Improving GD

Machine learning models don't improve like traditional software. When we "update" a model, it sometimes begins to mishandle cases it previously solved—an outcome known as regression or “forgetting.” This issue is well-studied in continual learning, where models learn multiple tasks sequentially (French, 1999). Standard solutions

Greedy is Good. Less Greedy May be Better.

Forward stepwise regression, agglomerative hierarchical clustering, and CART rely on a simple principle: make the best local choice at each step. Greedy choices can also be optimal when problems possess the greedy choice property—where globally optimal solutions can be reached through locally optimal decisions, as in minimum spanning trees

Hungary For More? Optimal 1-to-Many Matching for Causal Inference

The Hungarian algorithm (Kuhn–Munkres) efficiently finds optimal one-to-one matches between treated and control units by minimizing total matching cost (typically Euclidean distance in covariate space). It has been used for estimating treatment effects via matching. But it has a limitation: it is strictly one-to-one. In many causal inference settings,

Optimizing Early Trajectories in K-Means Clustering: Lookahead Initialization for K-Means

K-Means performance depends heavily on how clusters are initialized. While k-means++ improves over random starts by spreading centroids apart, it’s still greedy and can lock into suboptimal configurations—especially in noisy or high-dimensional data. This post explores a simple tweak: lookahead initialization. For each candidate seed, we simulate a

Internet Search is Not a Naive Information Retrieval Problem

"During RL training, we employ a curriculum-based rollout strategy that incrementally degrades the quality of generated documents, progressively eliciting the model’s reasoning ability by exposing it to increasingly challenging retrieval scenarios. Extensive experiments demonstrate that ZEROSEARCH effectively incentivizes the search capabilities of LLMs using a 3B LLM as

Cell Mates: Extracting Useful Information From Tables for LLMs

We seem to have cracked the art of distilling information in words and images using large machine learning models. However, our ability to exploit useful information in tabular data using large models is mostly missing. The upshot is that LLMs don't largely encode the knowledge from these tabular

Breaking the Monotony: Calibrating Without Preserving Monotonicity

Calibration refers to how well the predicted probabilities of an event match the actual frequencies of that event occurring. For instance, when a calibrated model predicts that an event has a 20% probability of occurring, that event occurs 20% of the time. The intuition extends to continuous values. When a