RL versus SFT for alignment

Ascertain whether reinforcement learning is more suitable than supervised fine-tuning for aligning Large Language Models—even when supervised fine-tuning uses high‑quality demonstrations—and characterize their fundamental differences in how they shape model behavior.

Background

The paper contrasts supervised fine-tuning (SFT) and reinforcement learning (RL) in alignment pipelines, noting evidence that SFT tends to memorize whereas RL can improve generalization to unseen variants. It also points out that SFT often stabilizes formats necessary for subsequent RL, implying complementary roles.

Nevertheless, the authors emphasize that the theoretical distinction and comparative suitability remain unsettled, calling for rigorous analysis to determine when RL outperforms SFT for alignment and how each fundamentally shapes behavior.

References

A core open question addresses whether RL is more suitable for alignment than SFT, even when the latter is supplied with high-quality demonstrations, and how these two paradigms fundamentally differ in shaping model behavior.

Beyond the Black Box: Theory and Mechanism of Large Language Models (2601.02907 - Gan et al., 6 Jan 2026) in Subsubsection Relationship between Training and Alignment, Section 5: Alignment Stage (Advanced Topics and Open Questions)