Latent-SFT: Reward-Based Fine-Tuning

Updated 24 October 2025

Latent-SFT is a fine-tuning framework that integrates latent reward modeling to optimize internal states, yielding enhanced generalization, sample efficiency, and convergence guarantees.
It utilizes bilevel optimization and analyzes latent activations to shift from mere behavior cloning towards a reward-based learning paradigm with improved robustness.
Applications span LLMs and MLLMs, employing techniques like latent token compression and parallel token forking to achieve efficient, scalable, and aligned reasoning.

Latent-SFT refers to a contemporary class of supervised fine-tuning methodologies for LLMs and multimodal LLMs that augment traditional imitation learning with latent—in many cases, implicit or explicit—reward signal extraction, optimization over latent internal states, or specialized compression and parallelization schemes. The unifying principle across Latent-SFT works is that the standard fine-tuning paradigm is insufficient for robust alignment or scalable reasoning; instead, it is necessary to extract and manipulate additional latent variables (reward models, preference scores, attention activation patterns, global forking tokens, or compressed latent reasoning states) during or before policy optimization. Recent advances treat SFT not as a primitive behavior cloning procedure but as a latent variable or reward-based optimization problem, yielding improved generalization, resilience to noisy demonstrations, increased sample efficiency, enhanced reasoning parallelism, and theoretical convergence guarantees.

1. Latent-SFT: Reward-based Reformulation of Supervised Fine-Tuning

The foundation of Latent-SFT is laid by reformulating supervised fine-tuning (SFT) in terms of latent reward learning. Conventionally, SFT maximizes likelihood on human demonstrations, typically cast as $\min_{\theta} \mathbb{E}_x [ D_\mathrm{KL}(\pi^E(\cdot|x)\|\pi(\cdot|x;\theta)) ]$ where $\pi^E$ is the expert policy. Latent-SFT can be cast as a bilevel maximum likelihood inverse reinforcement learning (ML-IRL) optimization (Li et al., 28 May 2024):

$\max_{\theta}~\ell(\theta)~~~ \text{s.t.}~\pi(\cdot|x) = \underset{\pi}{\arg\max}\,\mathbb{E}_{x\sim\rho,\ y\sim\pi}[r(x, y; \theta) - \beta\, D_\mathrm{KL}(\pi(\cdot|x)\|\pi_\mathrm{ref}(\cdot|x))]$

Here, the parameter vector $\theta$ jointly represents both the latent reward model $r(x, y; \theta)$ and the induced policy $\pi(\cdot|x)$ ; the reward is extracted directly from the demonstration data, even in the absence of explicit preference labels. The learning signal is a contrast between the log-likelihood gradients for demonstration samples and those for synthetic samples generated by the current policy, embodying a latent reward learning process. The bilevel optimization admits a minimax reformulation, which is theoretically analyzed with convergence guarantees to stationary solutions at $O(1/\sqrt{N})$ rate, where $N$ is the sample count. This explicit engagement with latent reward modeling directly improves generalization and model robustness, as demonstrated on multi-task language benchmarks, commonsense datasets, and code/math reasoning.

2. Connections to Implicit Reward Learning and Preference Optimization

Recent theoretical work reframes both SFT and preference-based post-training (e.g., Direct Preference Optimization, DPO) under a unified optimal policy–reward subspace (Wang et al., 15 Jun 2025). Standard SFT is shown to optimize an implicit reward function latent within expert demonstrations, particularly when viewed through the lens of f-divergence minimization:

$D_f(\mu_\pi \| \mu_E) + \beta\ KL(\pi \| \pi_{\mathrm{ref}})$

The effective reward signal is:

$r(x,y) = \beta\log\big(\frac{\pi^*(y|x)}{\pi_{\mathrm{ref}}(y|x)}\big) + (V^*(s_0) - V^*(s_t))$

This clarifies that SFT, often perceived as passive behavior cloning, is actually a latent reward learning process, albeit one where the KL penalty term vanishes during optimization and may lead the policy to unconstrained drift. Remedies such as reduced learning rate, or the adoption of alternative SFT objectives preserving the KL term gradient (e.g., Pearson $\chi^2$ , squared Hellinger), demonstrate marked performance improvements (up to 25% relative and 6% absolute win rate gains post-DPO). The link between LLM logits and Q-functions, $l_a = (\tau/\beta)\,Q(s,a) + C(s)$ , demonstrates the underlying reinforcement learning structure in SFT, further supporting the latent reward learning interpretation.

3. Latent Activation and Internal Model Dynamics

Another dimension of latent SFT is via the latent activation patterns in the internal architecture, notably attention head activation (Zhao et al., 24 Sep 2024). SFT selectively activates task-specific heads, quantified by gradient-based measures $(1/N)\sum_i (\Gamma_{l,h}^T\,\partial L(x_i)/\partial \Gamma_{l,h})$ . Complex task activation patterns are empirically shown to be compositional (approximately linear combinations of basic task patterns), traceable by regression analyses and metrics such as Gini coefficient and kurtosis. Small parameter changes during SFT yield significant shifts in overall head activation, which is leveraged for both interpretability and efficient task adaptation. The use of activation pattern similarity for data selection facilitates targeted fine-tuning without massive data volume, boosting accuracy and convergence speed for new domain tasks.

4. Latent SFT for Efficient Reasoning: Compression, Superposition, and Parallelism

Recent advances extend latent SFT principles into efficient reasoning frameworks, focusing on compressing explicit chain-of-thought into latent tokens, and embedding reasoning as a superposition over vocabulary probabilities (Deng et al., 17 Oct 2025). In these approaches, latent token encoders compress explicit reasoning steps with attention masks (LTIM and LTSuM), producing latent tokens $z_i = E\,\alpha_i$ where $E$ is the vocabulary embedding matrix and $\alpha_i$ is a softmax over vocabulary logits. Stage 1 aligns latent tokens to explicit chains; stage 2 trains the decoder to generate the latent tokens autonomously, using KL and cross-entropy losses. This construction enables compression (measured by effective compression rate) and global parallelism (captured by $N_{\mathrm{eff}}$ , Top-2 score). Empirically, Latent-SFT achieves state-of-the-art Pass@1 accuracy on GSM8k, Math500, and AIME24, matching explicit SFT with up to fourfold reduction in reasoning chain length. Unlike methods operating in raw hidden state space, vocabulary-restricted latent tokens mitigate distributional misalignment and enable robust multi-path reasoning.

Set Supervised Fine-Tuning (SSFT) implements another paradigm of parallel latent reasoning (Jia et al., 1 Oct 2025), introducing reserved global forking tokens, each matched via bipartite Hungarian assignment to unique reasoning traces. This process ensures that parallel token-conditioned generations explore distinct solution paths deterministically, substantially boosting both Pass@1 and consistency metrics across competitive mathematical benchmarks.

5. Latent Alignment in Multimodal LLMs

The concept of latent SFT extends to multimodal models (MLLMs), where reward learning and preference alignment fundamentally reshape the vision encoder and lead to emergent latent representations. Reinforcement learning via Direct Preference Optimization (DPO), as opposed to SFT, robustly improves not only language output alignment but also the vision pathway in MLLMs (Song et al., 18 Oct 2025). The Preference-Instructed Vision OpTimization (PIVOT) framework applies RL-based fine-tuning to the vision backbone, yielding significantly sharper, localized gradient and segmentation performance, and improved ImageNet Top-1 scores, at a fraction of the computational cost of standard vision pretraining. These results highlight the importance of post-training alignment in evolving latent multimodal representations, suggesting synergistic scaling of vision and language modules for future MLLM development.

6. Efficiency, Robustness, and Self-Tuning in Latent-SFT

Self-tuning latent SFT variants such as Online Supervised Finetuning (OSFT) (Li et al., 21 Oct 2025) further simplify training pipelines by using reward-free reinforcement of the model’s own best guesses. OSFT generates self-sampled responses at a low temperature and immediately fine-tunes on them at a higher temperature, amplifying the latent preference for correct reasoning paths without explicit reward signals. The loss function $\mathcal{L}_{\mathrm{OSFT}} = -\mathbb{E}_{q\sim\mathcal{D}, o\sim\pi_{\mathrm{old}}(\cdot|q;\tau_s)} [\log \pi_\theta(o|q;\tau_t)]$ guarantees efficient exploitation of the model’s latent knowledge. Empirically, OSFT matches or outperforms RL-based GRPO on mathematical reasoning tasks with only one rollout per prompt, showing remarkable data and computational efficiency. Ablation studies confirm the necessity of decoupled sampling and training temperatures, with couplings resulting in destructive or vanishing gradients.

7. Theoretical and Empirical Significance of Latent-SFT

The theoretical underpinnings of Latent-SFT articulate its advantages: convergence guarantees via minimax IRL reformulations, rigorous reward-policy mappings (including extensions of RL Q-function interpretation to standard SFT), and robust empirical validation on diverse tasks. The class of latent SFT methods—reward model learning, implicit reward extraction, latent activation pattern supervision, parallel reasoning via forking tokens, latent token-based reasoning compression, and self-tuning via OSFT—consistently outperforms baseline SFT in terms of accuracy, generalization, and alignment with human preferences.

Latent-SFT represents a principled shift toward extracting and optimizing latent information within the SFT process for both alignment and advanced reasoning in LLMs and MLLMs. This framework, supported by theoretical advances and empirical benchmarking, is central to modern strategies for increasingly capable, robust, and aligned foundation models.