Using Reward Models to Select Pre-Training Data

Large code models currently treat all code as equal during pre-training, then use RLHF to fix the resulting problems. This is backwards. We should allocate pre-training compute toward good code from the start.

The approach is straightforward: score code quality using verifiable signals, then weight training examples proportionally to their scores while preserving diversity across languages and domains. No new model architectures, no complex RL machinery—just weighted maximum likelihood with carefully designed weights.

The Core Idea

Instead of uniform sampling from filtered data, we minimize a weighted next-token loss:

$$\mathcal{L}(\theta) = \mathbb{E}{x\sim p}[w(x)\cdot\ell\theta(x)]$$

where weights are:

$$w(x) = f\left(\tau + \alpha\cdot\frac{\tilde{r}(x) - \mu}{\sqrt{\sigma^2(x)+\varepsilon}}\right)$$

Here $\tilde{r}(x)$ is a calibrated quality score, $\sigma(x)$ captures uncertainty, $\mu$ is a per-stratum baseline (normalized by language/domain/repo characteristics), and $\alpha$ controls the strength of quality bias. The transform $f$ (exponential or logistic) keeps weights bounded.

This is equivalent to training on a reweighted distribution $q(x) \propto w(x)p(x)$—we're changing what the model sees, not how it learns.

What Makes Code "Good"

We don't reuse chat reward models trained on prompt-response preferences. Some signals are:

Static signals (broad coverage, cheap):

  • Parse/compile success, type checking, linting
  • Complexity metrics (cyclomatic, nesting depth) with outlier capping
  • Test presence and structure (without execution)
  • Documentation density and identifier clarity

Execution signals (sparse, high-value):

  • CI/CD pass rates where available
  • Unit test results under strict sandboxing
  • Selected strategically via active learning where uncertainty is highest

Edit signals (learning from fixes):

  • PR diffs that fix bugs or reduce complexity are a positive signal
  • Reverted commits are a negative signal
  • Map changes to token spans for fine-grained weighting

Synthetic pollution control:

  • Penalize likely LLM-generated code (stylometry, watermarks, entropy patterns)
  • Aggressive deduplication so templates don't dominate through redundancy

Crucially, uncertain scores shrink toward neutral through the $\sigma(x)$ term—we don't overcommit to noisy signals.

Preserving Diversity

Pure quality optimization would collapse to a narrow distribution of "perfect" code. We prevent this through stratified normalization:

  1. Define strata by language, domain, repo size, license type
  2. Normalize weights within strata: enforce $\sum_{x \in s} w(x) = C_s$
  3. Clip extremes: maintain $w_{min} \leq w(x) \leq w_{max}$

This moves probability mass toward quality within each slice while preserving coverage across the full distribution. We keep Python web frameworks, embedded C, academic MATLAB, and creative coding—just bias toward better examples of each.

Why Now?

The Goodhart problem is already here. Web corpora increasingly contain LLM-generated code—often fluent but wrong. Training on this creates a feedback loop toward bland, incorrect patterns. Scoring based on verifiable signals (compilation, tests, type checking) counters this drift.

Existing work proves the value. DataComp-LM showed 40% compute savings through model-based filtering. CodeShell matched StarCoder with half the tokens through quality selection. But these use crude heuristics or binary filters. Weighted training with calibrated scores is the natural next step.

The infrastructure exists. We already have:

  • Static analysis at scale (GitHub's CodeQL, Google's Kythe)
  • Sandboxed execution (used for online judges, CI systems)
  • Reward model training pipelines (from RLHF development)

We're not inventing new technology—we're combining existing pieces more thoughtfully.

Implementation Pipeline

  1. Initial filtering: License compliance, basic deduplication
  2. Static scoring: Run cheap analyses, extract repo metadata
  3. Train scorer: Learn $s_\phi(x)$ from static features → quality labels
  4. Calibrate: Isotonic regression per stratum for $\tilde{r}(x)$
  5. Selective execution: Run tests on high-uncertainty samples via active learning
  6. Compute weights: Apply formula with curriculum schedule for $\alpha$
  7. Train model: Standard pre-training with example weights

The expensive part—execution—is kept sparse through active selection. Most scoring uses cheap static signals.

Relationship to RLHF

This doesn't replace RLHF. Weighted pre-training improves the base distribution; RLHF handles interaction-time constraints (tool use, refusal, safety). The goal is less remedial post-training—fixing fundamentals the model should have learned initially.

Think of it this way:

  • Pre-training with weights: Learn from better examples
  • RLHF/DPO: Adapt behavior for specific use cases

Both are valuable, but fixing data quality at the source is more efficient than correction after the fact.

Evaluation and Validation

Claims about "good code" require careful evaluation:

  • Clean held-out sets: Remove entire repos/packages from training
  • Multiple metrics: Not just pass@k, but compilation rates, type checking success, test generation quality
  • Diversity monitoring: Track entropy across languages/domains
  • Ablations: Compare uniform vs. hard-filtered vs. weighted training
  • RLHF reduction: Measure how much less post-training is needed

Document all decontamination. Report diversity metrics alongside quality improvements. Be honest about trade-offs.

Risks and Mitigations

Distribution collapse: Stratified normalization and weight clipping prevent this, but monitor continuously.

Gaming the metrics: Rotate scoring functions, use ensembles, audit suspicious patterns. The uncertainty term naturally downweights unreliable signals.

Computational cost: Execution is expensive. Keep it sparse and targeted through active learning. Rely primarily on static signals.

Benchmark hacking: Strict decontamination at the repository level. Hold out entire problem sets, not just specific solutions.

Optional Specialization: Comprehensibility-First

One compelling instantiation optimizes for human comprehension over performance. Additional scoring signals:

  • Comment quality and docstring completeness
  • Variable name descriptiveness
  • Function decomposition and single-responsibility
  • Explicit error handling with informative messages

This trades execution speed for maintainability—appropriate when code serves as human-AI collaboration rather than production deployment. Evaluate on debugging time and modification success, not just execution metrics.

Subscribe to Gojiberries

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe