R-SFT: Reversal Supervised Fine-Tuning

Updated 10 October 2025

R-SFT is a supervised fine-tuning strategy that employs reverse reasoning data alongside forward reasoning to enhance bidirectional reasoning in large language models.
It utilizes a dual-stage alignment pipeline with paired preference labeling and Direct Preference Optimization to maintain directional fidelity and reduce supervision conflicts.
Empirical results show a 1.6%–6.8% improvement in downstream accuracy, highlighting its potential to mitigate forgetting and calibrate reasoning performance.

Reversal Supervised Fine-Tuning (R-SFT) refers to a supervised learning strategy designed to leverage reverse reasoning data in the post-training of LLMs. Unlike conventional unidirectional fine-tuning—where models are trained only on forward chain-of-thought (CoT) reasoning—R-SFT introduces and emphasizes the role of reverse reasoning instances, employing a dual-stage alignment pipeline. The approach aims to mitigate the pitfalls associated with standard supervised fine-tuning (SFT), maintain or enhance model performance on bidirectional reasoning tasks, and address issues of directional interference inherent in mixed-data regimes (Deng et al., 16 Sep 2025).

1. Construction of Reverse Reasoning Datasets

The foundation of R-SFT lies in the generation of reverse reasoning datasets. Starting from a curated forward reasoning set $\mathcal{D}_{s1k} = \{(x_f^{(i)}, y_f^{(i)})\}_{i=1}^{1000}$ , where $x_f$ denotes the question and $y_f$ the corresponding chain-of-thought and answer, an automated prompting technique (using models such as DeepSeek-R1) produces inverted examples: $\mathcal{D}_{r1k} = \{(x_r^{(i)}, y_r^{(i)})\}_{i=1}^{1000}$ . Here, each $x_r$ is engineered to elicit the reasoning path in reverse. These datasets can be utilized independently (dedicated forward or reverse SFT) or merged ( $\mathcal{D} = \mathcal{D}_{s1k} \cup \mathcal{D}_{r1k}$ ) for mixed-data strategies. In all cases, fine-tuning targets the cross-entropy loss between the model output and reference CoT+answer, with parameter-efficient methods (such as LoRA) applied for model adaptation (Deng et al., 16 Sep 2025).

2. Bidirectional Reasoning Objectives and Directional Distinction

R-SFT is motivated by the observation that human reasoning and many desirable LLM capabilities are fundamentally bidirectional. By supplementing standard SFT with reverse reasoning data, the model can be exposed to both forward and backward inference patterns. However, empirical evidence demonstrates that naively mixing forward and reverse data during SFT introduces supervision conflicts. This conflict manifests as diminished performance on downstream metrics and erosion of the model's ability to maintain clear distinctions between reasoning directions, as quantified by reduced Average Log-Probability (ALP) margins. Paired preference labeling—where each forward instance $(x_f, y_f, y_r)$ is annotated with $y^+ = y_f$ , $y^- = y_r$ , and vice versa for reverse instances—enables more precise assessment and calibration of directional alignment (Deng et al., 16 Sep 2025).

3. Quantitative Evaluation: ALP Metrics and NTK Inspiration

The behavioral impact of R-SFT and associated alignment techniques is measured using ALP metrics: $ALP(y^+) = \frac{1}{N} \sum_{i=1}^N \frac{1}{|y^+_i|} \sum_t \log p(y^+_{i,t} | x_i, y^+_{i,<t})$

$ALP(y^-) = \frac{1}{N} \sum_{i=1}^N \frac{1}{|y^-_i|} \sum_t \log p(y^-_{i,t} | x_i, y^-_{i,<t})$

$\Delta = ALP(y^+) - ALP(y^-)$

where $\Delta$ quantifies the margin between preferred and less-preferred outputs. This metric, alongside empirical Neural Tangent Kernel (NTK) proxies, enables rigorous tracking of how R-SFT preserves or degrades directional certainty and preference separation throughout fine-tuning and subsequent preference optimization stages (Deng et al., 16 Sep 2025).

4. Integration with Direct Preference Optimization (DPO)

Following SFT, Direct Preference Optimization (DPO) is employed to reinforce alignment with directional objectives. The DPO loss is formulated as: $\mathcal{L}_{DPO} = -\log \left[ \frac{\exp(\log p_\theta(y^+|x) / \beta)}{\exp(\log p_\theta(y^+|x) / \beta) + \exp(\log p_\theta(y^-|x) / \beta)} \right]$ where $\beta$ is a temperature parameter. DPO compels the model to widen probability gaps between $y^+$ and $y^-$ , enhancing preference calibration. Empirical evidence indicates that while DPO can partially recover directional distinction when mixed-data SFT has previously narrowed the ALP margin, it also suppresses the generation of alternative, potentially valuable reasoning paths. This introduces a trade-off between strict directional fidelity and output diversity (Deng et al., 16 Sep 2025).

5. Empirical Results and Impact of R-SFT

Dedicated R-SFT (fine-tuning solely on the reverse dataset $\mathcal{D}_{r1k}$ ) yields marked improvements: a reported 1.6%–6.8% increase in downstream accuracy relative to forward-only SFT ( $\mathcal{D}_{s1k}$ ) across benchmarks including AIME, Math 500, and GPQA. Conversely, mixed-data SFT leads to reduced ALP gaps and inferior directional discrimination. Although post-SFT DPO offers some restitution, its effectiveness is limited if earlier supervision signals were contradictory. This suggests that direction-preserving fine-tuning is crucial for maximizing bidirectional reasoning capacity and overall alignment robustness (Deng et al., 16 Sep 2025).

6. Recommended Practices and Theoretical Considerations

Analysis corroborates that structured, direction-specific fine-tuning regimes outperform naive data mixing. Best practices involve separating forward and reverse SFT and, where bidirectional objectives are desired, combining this with DPO for additional alignment calibration. However, careful consideration must be given to initialization and the risk of over-suppressing non-preferred outputs. A plausible implication is that future alignment protocols should explicitly manage and safeguard directional distinctions rather than relying on blunt mixed-objective training (Deng et al., 16 Sep 2025). Theoretical framings of R-SFT benefit from NTK-inspired metrics and gradient-based evaluations, providing rigorous tools for diagnosing and addressing directional interference.

7. Relation to Forgetting Mitigation and Joint Training Frameworks

R-SFT intersects with broader efforts to address post-training forgetting in LLMs. Standard sequential post-training—separating SFT and RLHF/DPO into distinct phases—induces oscillatory behavior and suboptimal trade-offs between objectives, as shown in (Fernando et al., 20 Oct 2024). R-SFT tackles forgetting by incorporating an explicit reversal phase that counteracts parameter drift away from RLHF optima post-SFT. In contrast, joint frameworks update parameters via a simultaneous, weighted gradient combination: $\theta_{t+1} = \Pi_\Theta [\theta_t - \alpha_t (\lambda \nabla f_{RLHF}(\theta_t) + (1-\lambda)\nabla f_{SFT}(\theta_t))]$ where all learning signals are integrated at each step, providing theoretical convergence guarantees and stable trade-off maintenance. R-SFT's distinct two-stage design offers corrective post-processing, while joint frameworks embed trade-off balancing throughout optimization. Empirical studies corroborate that the joint approach yields superior overall performance and efficiency by directly preventing forgetting, whereas R-SFT focuses on targeted reversal after it occurs (Fernando et al., 20 Oct 2024).

Summary Table: R-SFT vs. Joint Frameworks

Method	Forgetting Mitigation	Directional Fidelity
R-SFT	Reversal post-SFT	High with direction-specific data
Joint Framework	Simultaneous gradients	Maintained throughout training

This distinction highlights the complementary roles of R-SFT and joint post-training frameworks in modern LLM alignment strategies.

R-SFT constitutes a robust methodology for enhancing LLM reasoning performance in bidirectional tasks, contingent upon vigilant data management and alignment calibration. When integrated with direction-aware preference optimization and theoretically grounded metrics, R-SFT provides a rigorous foundation for future improvements in multilingual, multimodal, and complex reasoning-enabled LLMs.