SFT as Inverse Reinforcement Learning

Updated 29 December 2025

The paper reinterprets supervised fine-tuning as a special case of inverse reinforcement learning, bridging behavioral cloning and reward-driven policy optimization.
It integrates forward KL divergence with temporal-difference regularization to recover dense, token-level rewards and improve diversity in model outputs.
Empirical results show that IRL-based modifications enhance generation diversity and task performance while addressing limitations such as mode collapse.

Supervised fine-tuning (SFT) of autoregressive LLMs, historically viewed as behavioral cloning or straightforward imitation learning, has a rigorous reinterpretation as a special case of inverse reinforcement learning (IRL). This equivalence brings new clarity to the objectives, algorithmic structure, and improvement pathways for aligning LLMs using demonstration data alone. By recasting SFT in the language of IRL, it is possible to define its statistical and optimization properties, clarify its limitations, and expose novel mechanisms for reward recovery and policy improvement.

1. Sequential Decision-Making and MDP Formulation

SFT in LLMs can be embedded within a Markov Decision Process (MDP) defined as follows:

States ( $s_t$ ): Partial token sequences, $(x_0,\ldots,x_{t-1})$ .
Actions ( $a_t$ ): Next token $a_t \in V$ drawn from the vocabulary.
Transitions: Deterministic, $s_{t+1} = \mathrm{concat}(s_t, a_t)$ .
Reward: Either concentrated at the terminal state (sequence-level) or distributed over tokens (token-level, as in dense reward recovery).
Expert dataset ( $D$ ): Set of demonstrations $(s, a, s')$ from human or expert trajectories.

The language modeling policy $\pi_\theta$ defines an occupancy measure $\mu_\pi(s, a) = \pi(a|s) \sum_{i=0}^\infty \gamma^i P(s_i = s | \pi)$ , and SFT aims to learn $\pi_\theta$ from trajectories sampled from the expert distribution.

2. SFT as Forward KL IRL and Objective Structure

Classical SFT minimizes the cross-entropy loss: $\mathcal{L}_\mathrm{SFT}(\pi) = -\mathbb{E}_{(x, y^*) \sim \mathcal{D}_\mathrm{exp}} \left[ \sum_{t=0}^K \log \pi(y^*_t | x, y^*_{0:t-1}) \right]$ This is exactly the forward KL divergence between expert and model-induced trajectory distributions: $\min_\pi \mathrm{KL}\left( d^\mathrm{exp}(y|x) \| d^\pi(y|x) \right)$ where $d^\pi(y|x)$ is the trajectory distribution under the model policy and $d^\mathrm{exp}(y|x)$ under the expert. This f-divergence IRL view reveals that SFT is a trajectory-level distribution matching algorithm. The "mass-covering" property is inherent: SFT penalizes policies that put zero probability on any expert trajectory, tending to disperse probability and sometimes diluting sharpness or unimodality in outputs (Sun, 2024).

3. Inverse Soft Q-Learning and Temporal-Difference Regularization

A more general IRL approach introduces maximum-entropy regularization and trades off between the SFT objective and a temporally consistent value function. This is embodied in inverse soft Q-learning (IQ-Learn), yielding the joint loss: $J_\mathrm{IQ}(V_\lambda, \pi) = \mathbb{E}_{(s,a,s')\sim D} \big[ \lambda (V_\lambda(s) + \log \pi(a|s) - \gamma V_\lambda(s'))^2 - \log \pi(a|s) \big]$ Here, $\lambda$ interpolates between pure SFT ( $\lambda=0$ ) and IRL ( $\lambda>0$ ) (Wulfmeier et al., 2024). The temporal-difference (TD) regularizer enforces a "soft" Bellman-consistency on value estimates, directly relating SFT to reinforcement-style training.

Pseudocode for offline IRL-SFT (from (Wulfmeier et al., 2024)):

Initialize θ, φ from pretrained SFT checkpoint
repeat until convergence:
  Sample minibatch B = {(s,a,s')} from SFT dataset D
  lp = log π_θ(a | s)
  v_s = V_φ(s)
  v_s′ = V_φ(s′)
  δ = v_s + lp  − γ·v_s′
  L_value = λ · mean(δ²)
  L_mle = −mean(lp)
  L_total = L_value + L_mle
  φ ← φ − η_φ ∇_φ L_total
  θ ← θ − η_θ ∇_θ L_total

Empirically, increasing $\lambda$ enhances diversity (e.g., lowers self-BLEU), occasionally yielding higher task accuracy (e.g., GSM8k: MLE ≈ 27.8% vs. IQ-Learn λ≈0.1 ≈ 31.2%) (Wulfmeier et al., 2024).

4. Dense Reward Recovery and Policy Enhancement

Recent work proves that SFT is a special case of inverse Q-learning, implying that logits from an SFT-trained policy encode an implicit dense, token-level reward via the soft Bellman/shaping identity: $\log \pi_\mathrm{SFT}(a_t|s_t) = r(s_t, a_t) + V(s_{t+1}) - V(s_t)$ A practical, baseline-relative extraction is: $\widehat r(s,a) = \log\pi_\mathrm{SFT}(a|s) - \log\pi_\mathrm{ref}(a|s)$ where $\pi_\mathrm{ref}$ is typically a halfway SFT checkpoint (Li et al., 2 Oct 2025). This enables granular credit assignment for each token and is leveraged by the "Dense-Path REINFORCE" approach, which further fine-tunes policy gradients using $\widehat r_t$ as reward. Dense-Path REINFORCE outperforms standard SFT and prior self-imitation methods across multiple open-source backbones and instruction-following benchmarks (Li et al., 2 Oct 2025).

5. Classification+Regression Decomposition and Theoretical Guarantees

Viewing softmax IRL and SFT through the lens of classification plus regression reveals the following separation (Laan et al., 25 Sep 2025):

Classification step: Fit $\hat u(s, a) \approx \log \pi_b(a|s)$ via cross-entropy minimization.
Regression step: Enforce soft Bellman consistency via iterative least-squares regression to recover $v^*$ and normalized reward $r^*$ .

The reward can be written as: $r^*(s,a) = u(s, a) - v^*(s,a) + \sum_{a'} \mu(a'|s) [\gamma v^*(s,a') - u(s, a')]$ where normalization measure $\mu$ can be chosen to encode prior preferences or bias. PAC-style generalization bounds ensure that the error in $v^*$ and $r^*$ can be controlled in terms of estimation errors from the classification and regression steps.

This perspective situates SFT as the "classification half" of IRL. Pure SFT fails to enforce global Bellman-consistency, explaining deficiencies in long-horizon desirability or exposure bias; incorporating regression stages post-SFT injects long-horizon structure with convex optimization at each step (Laan et al., 25 Sep 2025).

6. Empirical Properties, Practical Application, and Limitations

Empirical findings indicate:

IRL-regularized SFT (IQ-Learn variants) yields more diverse generations at parity or slight improvement in task accuracy compared to SFT (Wulfmeier et al., 2024).
Dense-Path REINFORCE, leveraging dense rewards extracted from SFT, provides 4–10 percentage point gains in instruction-following win-rates and mitigates length bias pathologies (Li et al., 2 Oct 2025).
Recovery of dense rewards enables improved downstream alignment, serving as bootstrapping signals for RLHF/RLAIF stages (Wulfmeier et al., 2024, Li et al., 2 Oct 2025).

Notable limitations:

SFT’s mass-covering bias is statistically efficient for closed, well-covered tasks but suboptimal for multimodal or noisy demonstrations (Sun, 2024).
IRL-based variants ameliorate some mode collapse but introduce further complexity and require judicious balancing of regularization (λ, γ).
The classification+regression decomposition exposes design choices (normalization measure μ, value function class), each affecting error bounds and computational cost (Laan et al., 25 Sep 2025).

7. Directions for Advanced Extensions and Ongoing Research

Emerging avenues include:

Hybrid divergences: α-divergence or interpolations to trade off between mass-covering and mode-seeking behavior (Sun, 2024).
Stabilized adversarial IRL/SFT: Developing more robust minimax optimization setups for large-scale LLMs.
Bellman consistency correction post-SFT: Cheap rollout/regression phases after SFT to address long-horizon errors without full RL (Laan et al., 25 Sep 2025).
Dense reward utilization for RLHF/RLAIF: Using SFT-recovered dense rewards as surrogate or bootstrapping metrics for preference-based alignment.
Normalization and reweighting: Selection of μ in IRL objectives mirrors token or context weighting in SFT and can be leveraged to encode prior alignment constraints.

A plausible implication is that as SFT is more tightly tied to IRL theory, advances in reward modeling, sequence-level divergence minimization, and policy improvement will increasingly draw on IRL-style value consistency tools. Computed dense rewards can be generalized to support finer credit assignment and more efficient data utilization even in settings where only demonstration data is available.

Table: Key Objectives and Their Properties

Objective	Divergence/Dynamics	Bias Type
SFT (MLE)	Forward KL	Mass-covering
Reverse-KL	Mode-seeking	Peaked/mode-focused
Jensen-Shannon	Intermediate	Trade-off
IQ-Learn (TD reg.)	Forward KL + TD loss	Interpolates

References:

"Imitating Language via Scalable Inverse Reinforcement Learning" (Wulfmeier et al., 2024)
"Supervised Fine-Tuning as Inverse Reinforcement Learning" (Sun, 2024)
"Beyond Imitation: Recovering Dense Rewards from Demonstrations" (Li et al., 2 Oct 2025)
"Inverse Reinforcement Learning Using Just Classification and a Few Regressions" (Laan et al., 25 Sep 2025)