By Gaurav in ML/Statistics — 28 Jun 2025

(Don't) Forget About It: Toward Pareto Improving GD

Machine learning models don't improve like traditional software. When we "update" a model, it sometimes begins to mishandle cases it previously solved—an outcome known as regression or “forgetting.” This issue is well-studied in continual learning, where models learn multiple tasks sequentially (French, 1999). Standard solutions involve rehearsal: periodically reviewing old examples to prevent their loss. However, ordinary supervised learning—training on a single task—typically lacks such protections. Models can "forget" patterns learned earlier in training with no explicit mechanism to prevent this regression.

The solution is to modify the training objective itself, adding explicit penalties that discourage forgetting:

Forgetting-Penalized Training. This method modifies the loss function to include an additional term when an example transitions from correct to incorrect classification. Instead of optimizing just the standard loss, the training objective becomes: total_loss = standard_loss + λ * forgetting_penalty, where the forgetting penalty increases the loss for examples that were previously classified correctly but are now wrong.
Soft Pareto-Penalized Training. This approach adds a penalty term whenever any individual example's loss increases compared to a previous checkpoint, regardless of whether the prediction changes from correct to incorrect. The modified objective becomes: total_loss = standard_loss + λ * Σ max(0, current_loss_i - previous_loss_i), penalizing any per-example deterioration.

We evaluated both approaches on the Adult income prediction task using identical model architectures and training procedures. We tracked two metrics: overall accuracy and "forgetting events"—instances where an example transitions from correct to incorrect classification during training.

Method	Total Forgetting	Final Train Acc	Final Val Acc
Baseline	5668	0.794	0.788
Forgetting Penalty	122	0.759	0.760
Soft Pareto Penalty	290	0.786	0.783

The Forgetting-Penalized method reduced forgetting events to 290 (a 90% reduction) but at the cost of lower final accuracy. The Soft Pareto method achieved a middle ground: 290 forgetting events (an 80% reduction) with minimal accuracy cost. These results demonstrate the fundamental trade-off: completely eliminating forgetting requires sacrificing some overall performance

Conclusion

By explicitly tracking and penalizing regressions—whether through binary forgetting penalties or continuous loss increases—we can control the forgetting-performance trade-off rather than accepting whatever standard training produces. In domains where regression on known examples is costly, such targeted constraints offer a path toward more reliable learning.

More at: https://github.com/finite-sample/pareto-gd

Conclusion

Related Reading Material

Subscribe to Gojiberries