UFT: Unifying Supervised and Reinforcement Fine-Tuning (2505.16984v1)

Published 22 May 2025 in cs.LG and cs.CL

Abstract: Post-training has demonstrated its importance in enhancing the reasoning capabilities of LLMs. The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small LLMs, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT's inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

Summary

The paper introduces UFT, a novel paradigm that unifies supervised fine-tuning and reinforcement fine-tuning using dynamically scheduled hints.
It leverages a hybrid objective that integrates log-likelihood maximization for hints with traditional RL rewards, improving sample efficiency for long-horizon tasks.
Experimental results across diverse model scales and tasks demonstrate UFT’s superiority over individual SFT and RFT methods in enhancing reasoning performance.

The paper "UFT: Unifying Supervised and Reinforcement Fine-Tuning" (Gourgout et al., 27 May 2024) introduces a novel post-training paradigm called Unified Fine-Tuning (UFT). It aims to combine the strengths of Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) for enhancing the reasoning capabilities of LLMs. SFT is efficient for smaller models but can lead to overfitting and limit reasoning in larger models. RFT offers better generalization but its success heavily depends on the base model's strength and can suffer from sparse reward issues. UFT integrates SFT and RFT into a single process, allowing models to explore solutions (like RFT) while incorporating informative supervision signals (like SFT).

Core Concepts of UFT

UFT has two key features:

Exploration with Hints: To address the sparse reward problem in RFT, especially for complex reasoning tasks or weaker base models, UFT guides exploration using "hints." A hint is a partial solution to the problem. The input to the LLM during training is the problem description concatenated with this hint. This increases the likelihood of the model exploring correct reasoning paths.
- Hint Length Sampling: The length of the hint is crucial. UFT employs a smoothed reduction of hint length over training.
  - A variable $p \in [0, 1]$ represents the proportion of the solution revealed as a hint.
  - $p$ decreases during training using a cosine annealing schedule:
    
    $p^{(t)} = p^{\text{low}} + \frac{1}{2}(p^{\text{high}} - p^{\text{low}})(1 + \cos(\frac{t+1}{T_{\text{hint}}}\pi))$
    
    where $t$ is the current training step, $T_{\text{hint}}$ is the total number of steps for which hints are used, and $p^{\text{low}}, p^{\text{high}}$ are the minimum and maximum proportions.
  - The actual hint length $l$ for a solution of total length $L$ is sampled from a Binomial distribution: $l \sim \text{Binomial}(L, p)$ . This ensures $\mathbb{E}[l] = p \cdot L$ . This approach provides a smoother transition from long to short (or zero) hints compared to staged reduction and better aligns training with evaluation (where no hints are given) compared to uniform sampling of hint lengths.
Objective Function Modification: UFT modifies the standard RFT objective to incorporate learning from hints. The standard RFT objective (e.g., in GRPO) focuses on maximizing expected reward while regularizing with KL divergence against a reference policy.

$\mathcal{J}^{\text{RFT}} = \mathbb{E}_{\sim \pi}[\mathcal{J}^{\text{value}} - \beta \sum \text{KL}(\pi \| \pi^{\text{ref}})]$

UFT extends this by adding a log-likelihood term for the hint portion of the trajectory. The UFT objective is:

$\mathcal{J}^{\text{UFT}} = \mathbb{E}_{\substack{l, s_0=s_{\text{root}} \ (s_h,a_h)_{h=0}^{l-1}\sim \pi^* \ (s_h,a_h)_{h=l}^{H-1} \sim \pi}} \left[ \mathcal{J}^{\text{value}}((s_h,a_h)_{h=l}^{H-1}) - \beta \sum_{h=l}^{H-1}\text{KL}(\pi(\cdot|s_h)\| \pi^{\text{ref}}(\cdot|s_h)) + \beta\sum_{h=0}^{l-1} \log \pi(a_h^*|s_h^*) \right]$

where:
- $(s_h^*, a_h^*)_{h=0}^{l-1}$ is the initial part of the trajectory taken from the ground-truth solution (the hint).
- The model generates the rest of the trajectory $(s_h, a_h)_{h=l}^{H-1}$ using its current policy $\pi$ .
- $\mathcal{J}^{\text{value}}$ is the RFT value term (e.g., from PPO/GRPO) applied to the model-generated part.
- The first KL term regularizes the model's policy $\pi$ towards a reference policy $\pi^{\text{ref}}$ for the generated part.
- The crucial addition is $\beta\sum_{h=0}^{l-1} \log \pi(a_h^*|s_h^*)$ . This term maximizes the log-likelihood of the model producing the tokens of the hint, effectively performing SFT on the hint.
This hybrid objective allows the model to: * Maximize expected reward for the explored part of the solution. * Stay close to a reference policy (preventing policy collapse). * Memorize/learn from the provided hint (SFT component).

UFT smoothly transitions from being more SFT-like (when hint length $l$ is large) to more RFT-like (as $l$ decreases to 0).

Implementation (Algorithm 1)

The practical implementation of UFT can be outlined as follows:

Algorithm UFT:
  Initialize policy parameters θ_0 from reference policy θ_ref
  Hyperparameters: KL-coeff β, total steps T, steps with hint T_hint,
                   hint prob p_low, p_high, max hint length L_hint_max

  For t = 0 to T-1:
    Sample a batch of problems B = {(Question, Solution, Answer), ...}
    Initialize an empty data list D_processed

    For each (Q, S, A) in B:
      If t < T_hint:
        // Calculate current hint proportion using cosine annealing
        p_t = p_low + 0.5 * (p_high - p_low) * (1 + cos((t+1) / T_hint * PI))
        // Sample hint length l_t from Binomial(min(L_hint_max, len(S)), p_t)
        l_t = sample_binomial(min(L_hint_max, length(S)), p_t)
      Else:
        l_t = 0 // No hint after T_hint steps

      // Create prompt: Question + Hint (S[:l_t])
      prompt_with_hint = Q + S[:l_t]
      Add prompt_with_hint to D_processed

    // Run RL algorithm (e.g., GRPO) on D_processed
    // using the UFT objective function (Equation 6)
    Update policy θ using gradients from J_UFT

Key Hyperparameters (from Table 2):

Training Batch Size: 256
Mini-batch Size: 64
Hint Length (L_hint_max): 5 (this seems to be a cap on the number of sentences/steps in a hint)
β (KL-penalty coefficient): 0.001
T (total number of steps): 500
T_hint (number of steps with hint): 300
Number of Rollouts: 4 (per problem in RL)
p_low (low probability for hint sampling): 0.05
p_high (high probability for hint sampling): 0.95
SFT Epochs (for SFT/SFT-RFT baselines): 5
Accuracy Reward: 1.0
Format Correctness Reward: 0.1
Incorrect Reward: 0.0

Theoretical Justification

The paper provides theoretical analysis showing:

RFT Lower Bound: Standard RFT has a sample complexity that is exponential in the length of reasoning $H$ (i.e., tree height), specifically $\Omega(B^H/K)$ to achieve a 50% pass@1 success rate, where $B$ is branching factor and $K$ is the number of correct solutions. This highlights why RFT struggles with long-horizon reasoning tasks.
UFT Upper Bound: UFT, due to its unified training with hints, can improve this to a polynomial dependence on $H$ . The paper shows UFT can achieve a 50% pass@1 success rate by exploring $O(B H^5 (\log B)^2 / \Delta^2)$ nodes, where $\Delta$ is the sub-optimality gap in rewards. This is an exponential improvement over RFT in terms of $H$ .

Experimental Validation

UFT was evaluated on Qwen2.5 (0.5B, 1.5B, 3B) and Llama3.2 (1B, 3B) models across tasks like Countdown (arithmetic reasoning), MATH (level 3-5 math problems), and Logic (Knights and Knaves logic puzzles).

Key Findings:

Performance across Model Scales:
- Small Models (e.g., Qwen2.5-0.5B): UFT's performance is comparable to or better than SFT and SFT-RFT (SFT followed by RFT). RFT alone performs poorly because small models struggle to explore correct solutions. UFT's hint mechanism and SFT-like objective help in memorizing solutions. For instance, on Logic, Qwen2.5-0.5B with RFT rarely explores the correct answer, while UFT finds it consistently (Figure 1).
- Example (Table 3, Qwen2.5-0.5B Avg. Accuracy): Base: 1.55%, SFT: 4.46%, RFT: 3.25%, SFT-RFT: 7.28%, UFT: 9.45%.
- Large Models (e.g., Qwen2.5-3B): UFT's performance is comparable to RFT, outperforming SFT and SFT-RFT, which tend to overfit. This shows UFT can also achieve good generalization.
- Example (Table 3, Qwen2.5-3B Avg. Accuracy): Base: 17.13%, SFT: 15.25%, RFT: 32.15%, SFT-RFT: 17.34%, UFT: 30.93%.
Learning New Knowledge:

UFT helps models acquire new knowledge, especially for models like Llama-3.2 that might have gained less reasoning-related knowledge during pretraining. Experiments show UFT significantly improves Llama-3.2's performance compared to RFT alone. For example, Llama-3.2-1B trained with UFT on Countdown outperformed Llama-3.2-3B trained with RFT (Figure 2). This suggests the SFT component of UFT is effective at knowledge injection.
Ablation on Hint Scheduler:

The cosine annealing hint length scheduler used by UFT, when combined with RFT (termed "RFT (cosine)"), outperforms RFT with uniform hint sampling ( $\text{R}^3$ ). However, RFT (cosine) alone is still often worse than SFT-RFT or SFT, highlighting the importance of UFT's modified objective function in addition to the hint scheduling (Figure 3).

Practical Implications and Applications

Versatile Fine-Tuning: UFT provides a more adaptive fine-tuning strategy that can work well across different model sizes and task complexities without needing to choose strictly between SFT or RFT.
Improved Sample Efficiency for Complex Tasks: For tasks with long reasoning chains or sparse rewards, the hint mechanism can significantly accelerate learning and improve final performance.
Knowledge Injection: The SFT component of UFT allows for more direct injection of knowledge from correct solution traces, potentially raising the performance ceiling limited by the model's pre-trained knowledge.
Problem Solving: UFT is well-suited for tasks requiring multi-step reasoning, such as mathematical problem solving, logical deduction, and code generation, where partial solutions can guide the model.

Implementation Considerations

Dataset Requirements: UFT requires datasets containing (problem, solution) pairs to generate hints and for the log-likelihood part of the objective.
Hyperparameter Tuning: UFT introduces hyperparameters related to hint scheduling ( $T_{\text{hint}}$ , $p^{\text{low}}$ , $p^{\text{high}}$ , $L_{\text{hint_max}}$) and the balance coefficient $\beta$ in the objective, which may require careful tuning.
Computational Cost: The training involves RL rollouts (potentially multiple per instance) and gradient computations for both RL and SFT-like objectives. The paper mentions the project cost "roughly $10,000 GPU hours," indicating it can be resource-intensive.
RL Algorithm Choice: The paper uses GRPO as the underlying RL algorithm. The UFT framework could potentially be adapted with other policy gradient algorithms.
Hint Granularity: The definition of a "hint" (e.g., number of tokens, sentences, or reasoning steps) might need to be adapted based on the task. The paper uses sentence-level hints for math/logic problems.

Limitations and Future Work

The current work primarily uses human-annotated solutions for hints. Future work could explore using solutions generated by larger models.
The underlying RL algorithm is GRPO. Exploring UFT with other advanced RFT algorithms (e.g., REINFORCE++, DAPO) is a potential direction.
The paper focuses on outcome-based rewards. Integrating process-based supervision within the UFT framework could be another avenue.

In summary, UFT presents a principled way to unify supervised learning and reinforcement learning for fine-tuning LLMs. By guiding exploration with dynamically scheduled hints and incorporating a log-likelihood objective for these hints, UFT aims to achieve both efficient knowledge acquisition and good generalization, outperforming SFT and RFT alone across various scenarios. Its theoretical backing for improved sample complexity further strengthens its potential for complex reasoning tasks.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (3)

Tweets

https://twitter.com/GptMaestro/status/1950223443497996689