Using Reward Models to Select Pre-Training Data
Large code models currently treat all code as equal during pre-training, then use RLHF to fix the resulting problems. This is backwards. We should allocate pre-training compute toward good code from the start.
The approach is straightforward: score code quality using verifiable signals, then weight training examples proportionally to their