- The paper demonstrates that RL-FT effectively restores OOD performance lost during SFT while maintaining strong in-distribution accuracy.
- It employs spectral diagnostics to reveal that directional shifts in singular vectors, not singular value changes, drive generalization dynamics.
- The study offers actionable guidelines for using low-rank and layerwise interventions to counteract catastrophic forgetting in LLMs.
RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
Introduction
This paper presents a rigorous comparative analysis of supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RL-FT) for LLMs, focusing on their effects on in-distribution (ID) and out-of-distribution (OOD) generalization. The authors employ controlled experiments on arithmetic reasoning tasks using Llama-3.2-11B and Qwen-2.5-7B, leveraging spectral diagnostics to dissect the parameter-level dynamics underlying catastrophic forgetting and generalization recovery. The paper challenges prevailing assumptions about RL-FT's role in LLM post-training, providing actionable insights for practitioners seeking to balance specialization and generalization.
Experimental Design and Key Findings
Controlled Evaluation Protocol
The authors utilize the GeneralPoints card-game benchmark and its OOD variant to probe arithmetic reasoning and generalization. SFT is performed on labeled data, followed by RL-FT using Proximal Policy Optimization (PPO) with scalar reward signals. Both ID and OOD performance are tracked across checkpoints, enabling fine-grained analysis of generalization dynamics.
OOD Generalization Dynamics
- SFT-Induced Forgetting: OOD generalization peaks early during SFT and degrades with continued training, even as ID accuracy monotonically improves. For Llama-3.2-11B, OOD performance drops from 17.52% (early SFT) to 8.97% (full SFT); for Qwen-2.5-7B, from 19.67% to 17.09%.
- RL-FT Recovery: RL-FT restores up to 99% of the OOD performance lost during SFT for Qwen-2.5-7B and 85% for Llama-3.2-11B, while maintaining strong ID competence. However, if SFT is prolonged and induces severe overfitting, RL-FT cannot fully recover OOD generalization.
- Trade-off: RL-FT rebalances the model, recovering generalization at the cost of a slight reduction in peak ID accuracy.
Spectral Analysis of Weight Matrices
- Singular Value Stability: Singular values of key weight matrices (Q, K, V projections) remain nearly unchanged after both SFT and RL-FT. The Frobenius norm and overall "energy" of the transformations are preserved, indicating that fine-tuning primarily re-orients the transformation in weight space rather than altering its amplification characteristics.
- Directional Shifts Dominate: OOD performance degradation and recovery correlate almost entirely with rotations of singular vectors at the spectrum's extremes (largest and smallest singular values), not with changes in singular values. Principal angle analysis reveals substantial rotation of singular vectors during SFT and RL-FT, with angles approaching 90° for high-index vectors.
- Layer and Rank Sensitivity: Restoring the singular vector directions of the top 20% of singular values or the first 25% of layers recovers 70–80% of a model’s OOD performance. Intermediate layers encode task-specific knowledge, while shallow and deep layers maintain general functional alignment.
Causal Validation
- Directional Restoration: Replacing post-SFT singular vector directions with those from the pre-trained model (while keeping SFT-learned singular values) recovers OOD generalization, confirming that directional drift is the primary mechanism of forgetting.
- RL-FT Feature Directions: Forcing RL-tuned models to adopt the geometric orientation of poorly-generalizing SFT models reverses the benefits of RL-FT, causing OOD accuracy to plummet.
Theoretical Implications
The paper provides a formal account of why fine-tuning prefers rotation over singular value modification. Gradient updates that rotate singular vectors incur minimal parameter cost and preserve the forward function, making them favored by optimizers under typical objectives with weight decay or KL regularization. This orthogonal-gauge drift explains the empirical signature: stable singular values and large singular vector rotations.
Practical Implications
- RL-FT as Restoration, Not Creation: RL-FT does not endow fundamentally new capabilities but counteracts SFT-induced directional drift, reducing catastrophic forgetting.
- Low-Rank and Shallow Recovery: Practitioners can employ inexpensive recovery knobs—low-rank UV merging and shallow-layer resets—to restore OOD generalization before resorting to costly RL-FT.
- Checkpoint Selection: The strength of the SFT checkpoint determines the efficacy of RL-FT in rescuing OOD ability; highly overfitted SFT checkpoints are harder to recover.
- Layerwise and Rankwise Interventions: Targeted restoration of singular vector directions in specific layers or ranks can selectively recover generalization without sacrificing ID performance.
Limitations and Open Questions
- Severe Overfitting: RL-FT cannot fully restore generalization when SFT induces severe overfitting and a marked distribution shift.
- Rotation Pattern Mechanisms: The exact nature of the shared rotation pattern between SFT and RL-FT remains unresolved; future work should investigate why both optimization regimes converge on similar rotation profiles.
- Task Generality: The findings are validated on arithmetic reasoning tasks; extension to other domains (e.g., code generation, advanced math) is necessary to establish universality.
Future Directions
The authors propose further experiments to isolate the role of head and tail singular values in generalization, perform RL-FT at multiple checkpoints, and apply spectral diagnostics to diverse tasks. Detailed weight trajectory analysis will elucidate which layers drive OOD improvements, guiding the development of more effective fine-tuning strategies.
Conclusion
This paper demonstrates that RL-FT is effective in counteracting moderate SFT-induced overfitting by restoring the geometric alignment of singular vectors, thereby recovering OOD generalization. Singular value stability and directional drift are the key mechanisms underlying catastrophic forgetting and its recovery. The results provide actionable guidance for practitioners and challenge the notion that RL-FT fundamentally enhances LLM capabilities; instead, it serves as a corrective mechanism. Future research should generalize these findings across tasks and further dissect the spectral dynamics of fine-tuning.