By Gaurav in ML/Statistics — 14 Feb 2025

Using Reward Models to Select Pre-Training Data

Large code models currently treat all code as equal during pre-training, then use RLHF to fix the resulting problems. This is backwards. We should allocate pre-training compute toward good code from the start.

The approach is straightforward: score code quality using verifiable signals, then weight training examples proportionally to their scores while preserving diversity across languages and domains. No new model architectures, no complex RL machinery—just weighted maximum likelihood with carefully designed weights.

The Core Idea

Instead of uniform sampling from filtered data, we minimize a weighted next-token loss:

$$\mathcal{L}(\theta) = \mathbb{E}{x\sim p}[w(x)\cdot\ell\theta(x)]$$

where weights are:

$$w(x) = f\left(\tau + \alpha\cdot\frac{\tilde{r}(x) - \mu}{\sqrt{\sigma^2(x)+\varepsilon}}\right)$$

Here $\tilde{r}(x)$ is a calibrated quality score, $\sigma(x)$ captures uncertainty, $\mu$ is a per-stratum baseline (normalized by language/domain/repo characteristics), and $\alpha$ controls the strength of quality bias. The transform $f$ (exponential or logistic) keeps weights bounded.

This is equivalent to training on a reweighted distribution $q(x) \propto w(x)p(x)$—we're changing what the model sees, not how it learns.

What Makes Code "Good"

We don't reuse chat reward models trained on prompt-response preferences. Some signals are:

Static signals (broad coverage, cheap):

Parse/compile success, type checking, linting
Complexity metrics (cyclomatic, nesting depth) with outlier capping
Test presence and structure (without execution)
Documentation density and identifier clarity

Execution signals (sparse, high-value):

CI/CD pass rates where available
Unit test results under strict sandboxing
Selected strategically via active learning where uncertainty is highest

Edit signals (learning from fixes):

PR diffs that fix bugs or reduce complexity are a positive signal
Reverted commits are a negative signal
Map changes to token spans for fine-grained weighting

Synthetic pollution control:

Penalize likely LLM-generated code (stylometry, watermarks, entropy patterns)
Aggressive deduplication so templates don't dominate through redundancy

Crucially, uncertain scores shrink toward neutral through the $\sigma(x)$ term—we don't overcommit to noisy signals.

Preserving Diversity

Pure quality optimization would collapse to a narrow distribution of "perfect" code. We prevent this through stratified normalization:

Define strata by language, domain, repo size, license type
Normalize weights within strata: enforce $\sum_{x \in s} w(x) = C_s$
Clip extremes: maintain $w_{min} \leq w(x) \leq w_{max}$

This moves probability mass toward quality within each slice while preserving coverage across the full distribution. We keep Python web frameworks, embedded C, academic MATLAB, and creative coding—just bias toward better examples of each.

Why Now?

The Goodhart problem is already here. Web corpora increasingly contain LLM-generated code—often fluent but wrong. Training on this creates a feedback loop toward bland, incorrect patterns. Scoring based on verifiable signals (compilation, tests, type checking) counters this drift.

Existing work proves the value. DataComp-LM showed 40% compute savings through model-based filtering. CodeShell matched StarCoder with half the tokens through quality selection. But these use crude heuristics or binary filters. Weighted training with calibrated scores is the natural next step.

The infrastructure exists. We already have:

Static analysis at scale (GitHub's CodeQL, Google's Kythe)
Sandboxed execution (used for online judges, CI systems)
Reward model training pipelines (from RLHF development)

We're not inventing new technology—we're combining existing pieces more thoughtfully.

Implementation Pipeline

Initial filtering: License compliance, basic deduplication
Static scoring: Run cheap analyses, extract repo metadata
Train scorer: Learn $s_\phi(x)$ from static features → quality labels
Calibrate: Isotonic regression per stratum for $\tilde{r}(x)$
Selective execution: Run tests on high-uncertainty samples via active learning
Compute weights: Apply formula with curriculum schedule for $\alpha$
Train model: Standard pre-training with example weights

The expensive part—execution—is kept sparse through active selection. Most scoring uses cheap static signals.

Relationship to RLHF

This doesn't replace RLHF. Weighted pre-training improves the base distribution; RLHF handles interaction-time constraints (tool use, refusal, safety). The goal is less remedial post-training—fixing fundamentals the model should have learned initially.

Think of it this way:

Pre-training with weights: Learn from better examples
RLHF/DPO: Adapt behavior for specific use cases

Both are valuable, but fixing data quality at the source is more efficient than correction after the fact.

Evaluation and Validation

Claims about "good code" require careful evaluation:

Clean held-out sets: Remove entire repos/packages from training
Multiple metrics: Not just pass@k, but compilation rates, type checking success, test generation quality
Diversity monitoring: Track entropy across languages/domains
Ablations: Compare uniform vs. hard-filtered vs. weighted training
RLHF reduction: Measure how much less post-training is needed

Document all decontamination. Report diversity metrics alongside quality improvements. Be honest about trade-offs.

Risks and Mitigations

Distribution collapse: Stratified normalization and weight clipping prevent this, but monitor continuously.

Gaming the metrics: Rotate scoring functions, use ensembles, audit suspicious patterns. The uncertainty term naturally downweights unreliable signals.

Computational cost: Execution is expensive. Keep it sparse and targeted through active learning. Rely primarily on static signals.

Benchmark hacking: Strict decontamination at the repository level. Hold out entire problem sets, not just specific solutions.

Optional Specialization: Comprehensibility-First

One compelling instantiation optimizes for human comprehension over performance. Additional scoring signals:

Comment quality and docstring completeness
Variable name descriptiveness
Function decomposition and single-responsibility
Explicit error handling with informative messages

This trades execution speed for maintainability—appropriate when code serves as human-AI collaboration rather than production deployment. Evaluate on debugging time and modification success, not just execution metrics.