Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 57 tok/s Pro

GPT-5 Medium 31 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 104 tok/s Pro

GPT OSS 120B 460 tok/s Pro

Kimi K2 216 tok/s Pro

2000 character limit reached

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs (2508.16546v1)

Published 22 Aug 2025 in cs.LG and cs.AI

Abstract: Training LLMs from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss from SFT (e.g., Llama-11B 8.97% to 15.38%, Qwen-7B 17.09% to 19.66%). But when SFT induces severe overfitting and a clear distribution shift, RL-FT cannot fully recover OOD performance. (2) Direction shifts of singular vectors matter more than singular value magnitudes. These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk spectrum intact. (3) Low-rank and shallow recovery is effective: restoring singular vector directions for the top 20% of values or first 25% of layers recovers 70-80% of OOD performance. (4) Stronger SFT checkpoints enable better recovery by RL, while overfitted ones resist restoration. These results reconcile prior reports of RL superior OOD performance: RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank UV merging and shallow-layer resets that practitioners can use before costly RL fine-tuning.

Collections

Summary

The paper demonstrates that RL-FT effectively restores OOD performance lost during SFT while maintaining strong in-distribution accuracy.
It employs spectral diagnostics to reveal that directional shifts in singular vectors, not singular value changes, drive generalization dynamics.
The study offers actionable guidelines for using low-rank and layerwise interventions to counteract catastrophic forgetting in LLMs.

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs

Introduction

This paper presents a rigorous comparative analysis of supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RL-FT) for LLMs, focusing on their effects on in-distribution (ID) and out-of-distribution (OOD) generalization. The authors employ controlled experiments on arithmetic reasoning tasks using Llama-3.2-11B and Qwen-2.5-7B, leveraging spectral diagnostics to dissect the parameter-level dynamics underlying catastrophic forgetting and generalization recovery. The paper challenges prevailing assumptions about RL-FT's role in LLM post-training, providing actionable insights for practitioners seeking to balance specialization and generalization.

Experimental Design and Key Findings

Controlled Evaluation Protocol

The authors utilize the GeneralPoints card-game benchmark and its OOD variant to probe arithmetic reasoning and generalization. SFT is performed on labeled data, followed by RL-FT using Proximal Policy Optimization (PPO) with scalar reward signals. Both ID and OOD performance are tracked across checkpoints, enabling fine-grained analysis of generalization dynamics.

OOD Generalization Dynamics

SFT-Induced Forgetting: OOD generalization peaks early during SFT and degrades with continued training, even as ID accuracy monotonically improves. For Llama-3.2-11B, OOD performance drops from 17.52% (early SFT) to 8.97% (full SFT); for Qwen-2.5-7B, from 19.67% to 17.09%.
RL-FT Recovery: RL-FT restores up to 99% of the OOD performance lost during SFT for Qwen-2.5-7B and 85% for Llama-3.2-11B, while maintaining strong ID competence. However, if SFT is prolonged and induces severe overfitting, RL-FT cannot fully recover OOD generalization.
Trade-off: RL-FT rebalances the model, recovering generalization at the cost of a slight reduction in peak ID accuracy.

Spectral Analysis of Weight Matrices

Singular Value Stability: Singular values of key weight matrices (Q, K, V projections) remain nearly unchanged after both SFT and RL-FT. The Frobenius norm and overall "energy" of the transformations are preserved, indicating that fine-tuning primarily re-orients the transformation in weight space rather than altering its amplification characteristics.
Directional Shifts Dominate: OOD performance degradation and recovery correlate almost entirely with rotations of singular vectors at the spectrum's extremes (largest and smallest singular values), not with changes in singular values. Principal angle analysis reveals substantial rotation of singular vectors during SFT and RL-FT, with angles approaching 90° for high-index vectors.
Layer and Rank Sensitivity: Restoring the singular vector directions of the top 20% of singular values or the first 25% of layers recovers 70–80% of a model’s OOD performance. Intermediate layers encode task-specific knowledge, while shallow and deep layers maintain general functional alignment.

Causal Validation

Directional Restoration: Replacing post-SFT singular vector directions with those from the pre-trained model (while keeping SFT-learned singular values) recovers OOD generalization, confirming that directional drift is the primary mechanism of forgetting.
RL-FT Feature Directions: Forcing RL-tuned models to adopt the geometric orientation of poorly-generalizing SFT models reverses the benefits of RL-FT, causing OOD accuracy to plummet.

Theoretical Implications

The paper provides a formal account of why fine-tuning prefers rotation over singular value modification. Gradient updates that rotate singular vectors incur minimal parameter cost and preserve the forward function, making them favored by optimizers under typical objectives with weight decay or KL regularization. This orthogonal-gauge drift explains the empirical signature: stable singular values and large singular vector rotations.

Practical Implications

RL-FT as Restoration, Not Creation: RL-FT does not endow fundamentally new capabilities but counteracts SFT-induced directional drift, reducing catastrophic forgetting.
Low-Rank and Shallow Recovery: Practitioners can employ inexpensive recovery knobs—low-rank UV merging and shallow-layer resets—to restore OOD generalization before resorting to costly RL-FT.
Checkpoint Selection: The strength of the SFT checkpoint determines the efficacy of RL-FT in rescuing OOD ability; highly overfitted SFT checkpoints are harder to recover.
Layerwise and Rankwise Interventions: Targeted restoration of singular vector directions in specific layers or ranks can selectively recover generalization without sacrificing ID performance.

Limitations and Open Questions

Severe Overfitting: RL-FT cannot fully restore generalization when SFT induces severe overfitting and a marked distribution shift.
Rotation Pattern Mechanisms: The exact nature of the shared rotation pattern between SFT and RL-FT remains unresolved; future work should investigate why both optimization regimes converge on similar rotation profiles.
Task Generality: The findings are validated on arithmetic reasoning tasks; extension to other domains (e.g., code generation, advanced math) is necessary to establish universality.

Future Directions

The authors propose further experiments to isolate the role of head and tail singular values in generalization, perform RL-FT at multiple checkpoints, and apply spectral diagnostics to diverse tasks. Detailed weight trajectory analysis will elucidate which layers drive OOD improvements, guiding the development of more effective fine-tuning strategies.

Conclusion

This paper demonstrates that RL-FT is effective in counteracting moderate SFT-induced overfitting by restoring the geometric alignment of singular vectors, thereby recovering OOD generalization. Singular value stability and directional drift are the key mechanisms underlying catastrophic forgetting and its recovery. The results provide actionable guidance for practitioners and challenge the notion that RL-FT fundamentally enhances LLM capabilities; instead, it serves as a corrective mechanism. Future research should generalize these findings across tasks and further dissect the spectral dynamics of fine-tuning.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (4)

Tweets

https://twitter.com/rosinality/status/1959822666170556461

https://twitter.com/fly51fly/status/1960103399019725166

https://twitter.com/RevanthAtmakuri/status/1960116974773498033

alphaXiv

RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs (150 likes, 0 questions)