Using Reward Models to Select Pre-Training Data
Large code models currently treat all code as equal during pre-training, then use RLHF to fix the resulting problems. This is backwards. We should allocate pre-training compute toward good code from the start.
The approach is straightforward: score code quality using verifiable signals, then weight training examples proportionally to their scores while preserving diversity across languages and domains. No new model architectures, no complex RL machinery—just weighted maximum likelihood with carefully designed weights.
The Core Idea
Instead of uniform sampling from filtered data, we minimize a weighted next-token loss:
$$\mathcal{L}(\theta) = \mathbb{E}{x\sim p}[w(x)\cdot\ell\theta(x)]$$
where weights are:
$$w(x) = f\left(\tau + \alpha\cdot\frac{\tilde{r}(x) - \mu}{\sqrt{\sigma^2(x)+\varepsilon}}\right)$$
Here $\tilde{r}(x)$ is a calibrated quality score, $\sigma(x)$ captures uncertainty, $\mu$ is a per-stratum baseline (normalized by language/domain/repo characteristics), and $\alpha$ controls the strength of quality bias. The transform $f$ (exponential or logistic) keeps weights bounded.
This is equivalent to training on a reweighted distribution $q(x) \propto w(x)p(x)$—we're changing what the model sees, not how it learns.
What Makes Code "Good"
We don't reuse chat reward models trained on prompt-response preferences. Some signals are:
Static signals (broad coverage, cheap):
- Parse/compile success, type checking, linting
- Complexity metrics (cyclomatic, nesting depth) with outlier capping
- Test presence and structure (without execution)
- Documentation density and identifier clarity
Execution signals (sparse, high-value):
- CI/CD pass rates where available
- Unit test results under strict sandboxing
- Selected strategically via active learning where uncertainty is highest
Edit signals (learning from fixes):
- PR diffs that fix bugs or reduce complexity are a positive signal
- Reverted commits are a negative signal
- Map changes to token spans for fine-grained weighting
Synthetic pollution control:
- Penalize likely LLM-generated code (stylometry, watermarks, entropy patterns)
- Aggressive deduplication so templates don't dominate through redundancy
Crucially, uncertain scores shrink toward neutral through the $\sigma(x)$ term—we don't overcommit to noisy signals.
Preserving Diversity
Pure quality optimization would collapse to a narrow distribution of "perfect" code. We prevent this through stratified normalization:
- Define strata by language, domain, repo size, license type
- Normalize weights within strata: enforce $\sum_{x \in s} w(x) = C_s$
- Clip extremes: maintain $w_{min} \leq w(x) \leq w_{max}$
This moves probability mass toward quality within each slice while preserving coverage across the full distribution. We keep Python web frameworks, embedded C, academic MATLAB, and creative coding—just bias toward better examples of each.
Why Now?
The Goodhart problem is already here. Web corpora increasingly contain LLM-generated code—often fluent but wrong. Training on this creates a feedback loop toward bland, incorrect patterns. Scoring based on verifiable signals (compilation, tests, type checking) counters this drift.
Existing work proves the value. DataComp-LM showed 40% compute savings through model-based filtering. CodeShell matched StarCoder with half the tokens through quality selection. But these use crude heuristics or binary filters. Weighted training with calibrated scores is the natural next step.
The infrastructure exists. We already have:
- Static analysis at scale (GitHub's CodeQL, Google's Kythe)
- Sandboxed execution (used for online judges, CI systems)
- Reward model training pipelines (from RLHF development)
We're not inventing new technology—we're combining existing pieces more thoughtfully.
Implementation Pipeline
- Initial filtering: License compliance, basic deduplication
- Static scoring: Run cheap analyses, extract repo metadata
- Train scorer: Learn $s_\phi(x)$ from static features → quality labels
- Calibrate: Isotonic regression per stratum for $\tilde{r}(x)$
- Selective execution: Run tests on high-uncertainty samples via active learning
- Compute weights: Apply formula with curriculum schedule for $\alpha$
- Train model: Standard pre-training with example weights
The expensive part—execution—is kept sparse through active selection. Most scoring uses cheap static signals.
Relationship to RLHF
This doesn't replace RLHF. Weighted pre-training improves the base distribution; RLHF handles interaction-time constraints (tool use, refusal, safety). The goal is less remedial post-training—fixing fundamentals the model should have learned initially.
Think of it this way:
- Pre-training with weights: Learn from better examples
- RLHF/DPO: Adapt behavior for specific use cases
Both are valuable, but fixing data quality at the source is more efficient than correction after the fact.
Evaluation and Validation
Claims about "good code" require careful evaluation:
- Clean held-out sets: Remove entire repos/packages from training
- Multiple metrics: Not just pass@k, but compilation rates, type checking success, test generation quality
- Diversity monitoring: Track entropy across languages/domains
- Ablations: Compare uniform vs. hard-filtered vs. weighted training
- RLHF reduction: Measure how much less post-training is needed
Document all decontamination. Report diversity metrics alongside quality improvements. Be honest about trade-offs.
Risks and Mitigations
Distribution collapse: Stratified normalization and weight clipping prevent this, but monitor continuously.
Gaming the metrics: Rotate scoring functions, use ensembles, audit suspicious patterns. The uncertainty term naturally downweights unreliable signals.
Computational cost: Execution is expensive. Keep it sparse and targeted through active learning. Rely primarily on static signals.
Benchmark hacking: Strict decontamination at the repository level. Hold out entire problem sets, not just specific solutions.
Optional Specialization: Comprehensibility-First
One compelling instantiation optimizes for human comprehension over performance. Additional scoring signals:
- Comment quality and docstring completeness
- Variable name descriptiveness
- Function decomposition and single-responsibility
- Explicit error handling with informative messages
This trades execution speed for maintainability—appropriate when code serves as human-AI collaboration rather than production deployment. Evaluate on debugging time and modification success, not just execution metrics.