Online Supervised Finetuning (OSFT)
- OSFT is a reward-free, iterative self-supervised method that finetunes LLMs by generating and supervising its own training data.
- It employs decoupled sampling and training temperatures to amplify latent reasoning, achieving competitive results on math benchmarks.
- The approach demonstrates high training efficiency and scalability, offering a simpler alternative to reinforcement learning strategies.
Online Supervised Finetuning (OSFT) is an iterative, reward-free paradigm for tuning LLMs, particularly with the aim of improving mathematical and logical reasoning. In OSFT, a pretrained model generates outputs on prompts and is immediately finetuned using its own responses by applying cross-entropy loss, operating in a fully self-supervised and self-amplifying fashion. Unlike classical supervised finetuning (SFT) which uses fixed, externally provided data, or reinforcement learning approaches that optimize explicit reward signals, OSFT constructs its training data online and relies solely on its own generative behavior to drive improvements. This mechanism has demonstrated highly competitive downstream performance on mathematical reasoning benchmarks, often rivaling strong reinforcement learning with verifiable reward (RLVR) baselines such as GRPO, with substantial gains in training efficiency and simplicity (Li et al., 21 Oct 2025).
1. Core Definition and Conceptual Distinctions
OSFT operates as a “self-help” learning strategy for pretrained LLMs. The defining features are:
- Online data construction: Instead of training on a static dataset, the model generates new (prompt, response) pairs for each finetuning step, predominantly sampling its current preferred outputs at a low temperature to reflect high-confidence responses.
- Supervised update: Finetuning is performed using the standard cross-entropy loss on the self-generated data, with no explicit reward, advantage estimation, or desirability score.
- Contrast to standard SFT: Traditional SFT fixes the training targets (e.g., human-annotated or distilled data); OSFT continually evolves its own training signal as its capabilities shift.
- Contrast to RLVR/RLHF: RL-based methods, such as RLVR (including GRPO, PPO), sample multiple trajectories and use external correctness or human preference signals as rewards. OSFT is devoid of external supervision beyond the initial pretraining (Li et al., 21 Oct 2025).
The objectives of OSFT are to amplify a model’s latent, pretrained reasoning ability, provide efficient and scalable training regimes, and improve generalization on challenging structured tasks such as mathematical problem solving.
2. Algorithmic Protocol and Training Objective
The OSFT algorithm involves the following steps at each iteration:
- Prompt sampling: Draw a batch of prompts from a pool (e.g., DeepScaleR, OpenThoughts-Math).
- Response generation: For each prompt, sample responses from the current model under a sampling temperature , focusing generation on the most likely candidates.
- Supervised update: Finetune the model for iterations on the newly constructed set of (prompt, response) pairs using cross-entropy loss with a training temperature .
This process is summarized by the training objective: where the outer expectation samples prompts and the inner expectation samples responses from the current model (Li et al., 21 Oct 2025).
Crucial hyperparameters include sampling temperature (), training temperature (), number of rollouts (), batch size, learning rate, and the number of inner supervised iterations (). Typical practice sets (single rollout per prompt) and , yielding high data-efficiency and reduced computational overhead.
| Parameter | Typical value (Qwen2.5-Math) | Usage |
|---|---|---|
| 0.6 (math), 0.9 (general) | sampling temperature | |
| 1.0 | training temperature | |
| 1 (default), 4 (ablation) | rollouts per prompt | |
| batch size | 128 prompts | per optimization step |
| learning rate | (7B), (1.5B) | optimizer step size |
| epochs | 1 (∼300 steps) | full passes over dataset |
3. Theoretical Properties and Training Dynamics
The theoretical foundation of OSFT relies on the interaction between sampling and training distributions:
- Score-function degeneracy: When sampling and training temperatures are equal (), the expected gradient vanishes. Thus, decoupling and is critical: is necessary for meaningful updates.
- Margin amplification: The update direction can be formally characterized by the difference between the model’s current sharp (sampled) and flatter (training) distributions: where is “sharper” and is “flatter”. The gradient increases the logit difference between high-confidence modes and other outputs, leading to larger probability gaps favoring the model's existing best responses.
- Nature of learning: OSFT does not import new semantic knowledge, but instead amplifies or sharpens reasoning trajectories already accessible to the model’s existing distribution. This can be viewed as a contrastive mechanism operating entirely on self-sampled chains (Li et al., 21 Oct 2025).
4. Empirical Findings and Comparative Evaluation
Performance of OSFT has been extensively evaluated on mathematical reasoning corpora. Key datasets include DeepScaleR for training and a battery of test sets: Math500, AMC, Minerva-Math, OlympiadBench, AIME24, and AIME25.
Main quantitative results (Qwen2.5-Math-7B, aggregated across benchmarks):
| Method | pass@1 (%) | pass@8 (%) | pass@avg (6 sets) |
|---|---|---|---|
| Base | 12.43 | 41.47 | 51.22 |
| GRPO | 33.45 | 57.65 | 63.22 |
| OSFT (G=1) | 35.97 | 55.61 | 60.14 |
- OSFT closely matches RLVR (GRPO) at small and significantly outperforms the base model; with larger , RLVR leverages higher sample diversity.
- Multiple ablations show that increasing the number of rollouts () improves top-1 accuracy but yields diminishing returns for higher , while OSFT retains a significant efficiency advantage (%%%%2526%%%% faster with vs RLVR with ).
- Lower perplexity after OSFT correlates with increased reasoning acuity, supporting the margin-widening argument.
Cross-architecture and cross-dataset comparisons indicate that OSFT consistently improves both specialized arithmetic LLMs (Qwen2.5-Math) and general-purpose instruction-tuned models (Llama3.1-8B-Instruct), though models with weak latent reasoning benefit less.
5. Ablation Studies and Sensitivity Analyses
Ablation experiments systematically evaluated hyperparameter sensitivities:
- Sampling vs training temperature ( vs ): Setting yields no observable gains; only decoupling with achieves robust improvement.
- Number of rollouts (): Higher benefits pass@1 marginally but brings little gain for pass@8, indicating the single-sample protocol is already highly efficient.
- Training-data source: Switching between DeepScaleR and OpenThoughtsMath induces small (−1% to −3%) variations in performance, indicating reasonable generalization across mathematically oriented corpora.
- Evaluation temperature (): There is alignment between OSFT and GRPO in optimal temperature regimes for output diversity.
6. Implementation Considerations and Practical Protocols
OSFT is typically implemented atop established deep learning libraries (e.g., VERL and HuggingFace Transformers). Practical training configurations leverage moderate hardware resources (e.g., 8× NVIDIA A800 GPUs), enabling a complete epoch (∼300 steps) on mathematical corpora in 10–12 hours for a 7B parameter model. The codebase provides command-line support for controlling temperatures, rollout count, batch size, and learning rate.
The workflow can be initiated with:
1 2 3 4 5 |
git clone https://github.com/ElementQi/OnlineSFT.git cd OnlineSFT pip install -r requirements.txt python train_osft.py --model=qwen2.5-math-7b --data=DeepScaleR.json --tau_s=0.6 --tau_t=1.0 --G=1 --batch_size=128 --lr=1e-7 python eval_passk.py --model=checkpoint/ --benchmarks |
7. Limitations, Interpretations, and Extensions
Key findings:
- OSFT establishes that explicit reward signals are not essential for achieving strong downstream gains in LLM reasoning, provided the model’s own sampling is leveraged with appropriate temperature decoupling.
- The benefit is proportional to the quality of the pretrained model’s latent reasoning: OSFT is most effective when the initial model already exhibits some proficiency in the target domain.
Limitations:
- Statistical significance of empirical gains has not been formally reported.
- With weakly pretrained models, OSFT produces negligible improvements.
- Learning was only conducted for a single epoch; longer training regimes and potential curriculum effects remain unexplored.
Potential extensions include hybrid reward/self-supervised protocols, application to non-mathematical reasoning domains (code, logic), adaptive scheduling of temperature or multiple inner SFT iterations, and further exploration of theoretical connections to contrastive methods such as Direct Preference Optimization.
OSFT offers an efficient, scalable, and reward-free methodology for self-amplification in LLMs, constituting a competitive alternative to reinforcement learning-based strategies for mathematical and logical reasoning enhancement (Li et al., 21 Oct 2025).