Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 104 tok/s
Gemini 3.0 Pro 36 tok/s Pro
Gemini 2.5 Flash 133 tok/s Pro
Kimi K2 216 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

SFT-based Training: Enhancing LLM Reasoning

Updated 16 October 2025
  • SFT-based Training is the process of fine-tuning language models on expert-generated responses, emphasizing detailed, multi-step reasoning.
  • It employs a two-stage approach: initial SFT with long chain-of-thought responses followed by preference optimization via pairwise ranking.
  • Empirical results demonstrate an 8.6% boost in math reasoning accuracy and a 25.9% increase in output length, confirming its practical benefits.

Supervised Fine-Tuning (SFT) is a principal method for adapting LLMs to target tasks and aligning them with desired behaviors. SFT typically involves optimizing model parameters on datasets of demonstration responses, often generated by expert models or human annotators. While foundational to contemporary model alignment and capability enhancement, the increasing complexity of LLM tasks—particularly long chain-of-thought (CoT) reasoning—has exposed several limitations of classical SFT, including plateaued improvements and sub-optimal exploitation of fixed datasets.

1. Overview and Motivation

Supervised Fine-Tuning (SFT) denotes the process of optimizing an LLM on paired (input, output) data in a fully supervised manner, where the outputs are usually generated by larger or more capable models, or by human annotators. In reasoning-intensive domains, SFT typically uses long CoT responses as targets to instill systematic, multi-step reasoning ability in relatively smaller models. However, due to data collection cost constraints and diminishing returns after repeated fine-tuning on static datasets, SFT often exhibits a performance plateau or even a degradation in reasoning depth and quality (Yang et al., 17 Feb 2025).

The primary motivation addressed in the referenced work is to enhance LLMs' reasoning abilities using only existing long CoT SFT data—without incurring the cost and delay inherent in collecting new high-quality, detailed demonstrations.

2. Two-Stage Training Workflow: SFT Followed by Preference Optimization

The methodology is structured as a two-stage post-training pipeline:

Stage 1: Reasoning SFT

Here, the model is fine-tuned using a dataset Dsft={(q,olong)}D_{\text{sft}} = \{(q, o_{\text{long}})\}, where qq denotes the question and olongo_{\text{long}} is a long, detailed CoT answer. This establishes a baseline model with enhanced long-form reasoning skills.

Stage 2: Direct Preference Optimization (DPO) with Short vs. Long CoT Pairing

A secondary dataset Ddpo={(q,olong,oshort)}D_{\text{dpo}} = \{(q, o_{\text{long}}, o_{\text{short}})\} is constructed by pairing each question with both the high-quality long CoT response and a short CoT response. The short responses are generated using a less capable model and filtered to ensure correctness, ensuring that the only distinguishing factor is the richness of the reasoning rather than basic accuracy.

For each triple (q,olong,oshort)(q, o_{\text{long}}, o_{\text{short}}), the model is trained via a pairwise ranking loss—analogous to the DPO framework—to prefer longer, more sophisticated reasoning paths.

A schematic pseudocode for the DPO-based update for each sample:

1
2
3
4
5
6
for (q, o_long, o_short) in D_dpo:
    score_long = model(q, o_long)
    score_short = model(q, o_short)
    loss = -math.log(sigmoid(score_long - score_short))
    loss.backward()
    optimizer.step()

The hyperparameters—learning rate, batch size, DPO-specific scaling β\beta—have a pronounced impact on stability and must be carefully tuned.

3. Performance Impact of Post-SFT Preference Optimization

The empirical results strongly support the efficacy of post-SFT preference optimization. When ThinkPO is applied to SFT-tuned models:

  • Math reasoning accuracy increases by approximately 8.6%.
  • Average output length is increased by 25.9%.
  • On concrete benchmarks, the DeepSeek-R1-Distill-Qwen-7B model's performance on MATH500 improves from 87.4% → 91.2% after the ThinkPO stage.

Training curves reported in the paper show a distinct jump in both output length and reasoning-supportive lexical indicators (e.g., phrases like "wait," "let's see") following DPO preference optimization, surpassing SFT-only models that have reached saturation.

Table: Key Results Attributable to ThinkPO

Metric SFT Only SFT + ThinkPO Relative Gain
Math Reasoning Accuracy (MATH500) 87.4% 91.2% +8.6%
Output Length Baseline +25.9%
Dataset Coverage Saturated Increased

4. Core Principles and Practical Implementation Considerations

  • Data Reuse: ThinkPO avoids the need for new, expensive long CoT data by leveraging pairs of existing high-quality (long) and easily generated (short) CoT responses on the same prompt, thus maximizing the utility of available resources.
  • Pairwise Preference Training: The ranking-based loss encourages models to not just produce correct answers, but also to value elaborate, multi-step thought processes—addressing a limitation of classic SFT, which optimizes only for token-level likelihood.
  • Hyperparameter Sensitivity: Tuning is critical; an excessively large gap in response lengths or a poorly chosen DPO scaling parameter β\beta can destabilize optimization or lead to degenerate outputs biased towards verbosity without instructional value.
  • General Applicability: The approach is validated across models from 3B to 14B parameters and is demonstrated on openly available distilled models, suggesting broad scalability.

5. Applications, Trade-offs, and Future Directions

Applications:

  • Math and science domains where verbose, multi-step justification is essential.
  • Publicly available models where expanding to more detailed reasoning is desirable but new gold data is expensive or unavailable.
  • Situations where preference for structured, comprehensive answers is more important than succinctness.

Trade-offs:

  • There is a need to balance output length and informativeness; unqualified amplification of output length can degrade conciseness without tangible reasoning benefits.
  • The generalization of the approach to domains outside math and reasoning-intensive tasks (e.g., general conversational agents) is yet to be established.

Future Research:

  • Development of improved algorithms for automatic selection of short/long response pairs to further refine the signal provided to the DPO stage.
  • Extension to multimodal or more open-ended generation settings.
  • Integration with other post-training alignment techniques (e.g., RLHF variants) to optimize for both reasoning quality and user satisfaction.

6. Limitations and Open Questions

  • Hyperparameter Sensitivity: Model improvements are susceptible to the choice of learning rate and DPO scaling factor; without careful tuning, training can be unstable.
  • Pairwise Sample Construction: If the difference between chosen and rejected samples is too extreme (particularly in length or detail), the optimization signal may become unreliable.
  • Beyond Reasoning Tasks: The technique is currently validated on mathematical reasoning tasks; its efficacy on tasks where longer outputs may not be strictly better is to be determined.

7. Significance in the SFT-Based Training Landscape

Post-SFT preference optimization as formalized in ThinkPO provides a targeted and lightweight strategy for overcoming the pronounced performance plateau afflicting LLM reasoning after standard SFT. By systematically favoring longer, detail-rich CoT responses using existing data, ThinkPO delivers quantifiable gains in reasoning correctness and explanation richness. This approach exemplifies how augmentations to traditional SFT—especially when framed within preference-based learning paradigms—can drive continued scaling of reasoning abilities in cost-sensitive and data-limited real-world settings (Yang et al., 17 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SFT-based Training.