Papers
Topics
Authors
Recent
Search
2000 character limit reached

PeSFT: Perception-Aligned Supervised Fine-Tuning

Updated 4 June 2026
  • PeSFT is a supervised fine-tuning approach that explicitly balances perception and reasoning by reweighting loss contributions in multimodal models.
  • It employs dynamic gradient-norm balancing and feature-space alignment to overcome token imbalance and ensure robust perceptual processing.
  • Experimental results demonstrate significant improvements in perception accuracy and overall performance across visual reasoning, text-to-image synthesis, and embodied planning tasks.

Perception-Aligned Supervised Fine-Tuning (PeSFT) refers to a class of supervised fine-tuning methodologies designed to explicitly address and optimize the perceptual components of multimodal and vision-LLMs. These approaches aim to balance the learning signal between perception (i.e., accurate extraction or alignment of input modality information such as image-to-text grounding, visual layout, or aesthetics) and downstream tasks such as reasoning, planning, or generation. PeSFT seeks to overcome systematic imbalances in standard supervised fine-tuning regimes that cause perception modules to be under-optimized, resulting in bottlenecks for overall system performance, especially in end-to-end visual reasoning, text-to-image alignment, and complex multimodal workflows (Wu et al., 28 May 2026, Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).

1. Motivation and Problem Formulation

A recurring empirical finding across recent large-scale vision-language and multimodal generation models is an asymmetry in post-training improvements: while supervised fine-tuning (SFT) or post-training significantly enhances reasoning capabilities, perception components (e.g., extracting or aligning low-level scene information or attributes) show limited gains. This discrepancy constrains overall task performance, notably for tasks requiring tightly coupled perception and reasoning or photorealistic, semantically aligned generation.

In the canonical chain-of-thought (CoT) framework for VLMs, the output sequence can be decomposed into two contiguous segments: a perception segment pp (grounded in visual/textual input) and a reasoning segment rr (goal-directed or logical deduction). Let πθ\pi_\theta be the trained auto-regressive model, xx the multimodal input, and y=(p,r)y=(p, r) the CoT sequence. Measured metrics commonly include:

  • Perception Accuracy: ap=1[p^=p]a_p = \mathbb{1}[\hat{p} = p^*]
  • Reasoning Accuracy: ar=Acc(rp)a_r = \mathrm{Acc}(r' | p^*) (evaluated by sampling reasoning conditioned on oracle perception)
  • End-to-end Accuracy: a=Acc(r^p)a = \mathrm{Acc}(\hat{r} | p^*)

Observed during conventional SFT: ara_r increases rapidly with training, while apa_p remains near baseline, choking end-to-end accuracy (Wu et al., 28 May 2026). This is not confined to VLMs, but also affects text-to-image generators and embodied planning models (Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).

2. Root Causes: Token Imbalance and Feature-Level Misalignment

Token Imbalance in Chain-of-Thought Supervision

Standard SFT computes cross-entropy over the full output sequence rr0:

rr1

When parsing rr2, the perception segment rr3 typically constitutes only 2–3% of tokens, so its contribution to the total loss and gradient is disproportionately small. This starves perception of optimization signal, especially as rr4. Chain-of-thought SFT thus inherently favors reasoning at the expense of perception (Wu et al., 28 May 2026).

Inadequate Feature-Level Alignment in Generative Models

For text-to-image diffusion models (MM-DiT, etc.), naïve SFT on curated datasets may increase photorealism but at the expense of text-image alignment or aesthetic controllability. This reflects a lack of dense, semantically grounded guidance at the feature level—pixel/latent-level targets risk overfitting and prior collapse, failing to resolve the trilemma between alignment, realism, and aesthetics (Wang et al., 20 May 2026).

This suggests that both gradient flow (token imbalance) and supervision granularity (missing vision-aligned feature guidance) are central to perception collapse in most current SFT pipelines.

3. Core Methodologies in PeSFT

Dynamic Loss Reweighting for Perception and Reasoning

A central innovation is the explicit rebalancing of loss contributions from perception and reasoning sub-tasks. Several variants are documented:

  • Fixed Reweighting: Introducing a static hyperparameter rr5 to upweight the normalized perception loss:

rr6

where rr7, rr8 are total negative log-likelihoods over each segment.

  • Dynamic Gradient-Norm Balancing (NGDiff): At each mini-batch, compute the rr9 norm of the perception- and reasoning-segment gradients, and set

πθ\pi_\theta0

yielding an adaptive, tuning-free weighting that equalizes gradient impact and restores perception learning (Wu et al., 28 May 2026).

Feature-Space Alignment in Diffusion Models

For text-to-image diffusion transformers, PeSFT integrates a lightweight supervision path using frozen vision foundation models (e.g., SigLIP 2):

  • The text prompt is encoded into multi-granularity features (global semantics, patch-level correspondences, structural/aesthetics).
  • A single shallow image-feature layer is projected to the text feature space via an MLP, and a cosine loss is computed:

πθ\pi_\theta1

The total training loss is:

πθ\pi_\theta2

By restricting supervision to a shallow layer and using low πθ\pi_\theta3, the generator preserves prior diversity and avoids overfitting (Wang et al., 20 May 2026).

Reinforced Supervised Fine-Tuning in Embodied Agents

Perception-aligned objectives in sequential multimodal planning add a reinforcement signal that rewards spatial alignment between generated images and predicted actions:

  • πθ\pi_\theta4
  • πθ\pi_\theta5 is a policy-gradient loss, with the reward function πθ\pi_\theta6 quantifying the match in dynamic regions (IoU and MSE) between ground-truth and sampled images (Cai et al., 3 Nov 2025).

4. Algorithmic Implementations

Table: Overview of Representative PeSFT Algorithms

Context Loss Balancing Method Implementation Feature
Chain-of-thought SFT Dynamic gradient-norm (NGDiff) Adaptive πθ\pi_\theta7 schedules based on instantaneous gradient norms (Wu et al., 28 May 2026)
Diffusion transformers Feature-space alignment Cross-modal cosine loss at shallow layer using SigLIP 2 features (Wang et al., 20 May 2026)
Multimodal planning Reinforced SFT Hybrid supervised + RL loss, reward for pixel-space perceptual alignment (Cai et al., 3 Nov 2025)

In all cases, inference-time costs remain unchanged compared to standard SFT; PeSFT modules are only active during training.

5. Experimental Evidence and Quantitative Results

Vision-Language Reasoning (Synthetic Tasks)

On synthetic tasks (Graph Coloring, Sudoku), PeSFT (gradient-norm dynamic reweighting) yields major end-to-end accuracy improvements for both Qwen3-VL-2B-Instruct and InternVL3.5-2B, with gains up to 18.2 percentage points compared to standard SFT. Under PeSFT, perception accuracy (πθ\pi_\theta8) rises from <5% to 25–30%, with minimal trade-off to reasoning accuracy. The tradeoff curve of perception vs. reasoning (by varying πθ\pi_\theta9) is Pareto-like; moderate perception upweighting optimizes overall performance (Wu et al., 28 May 2026).

Task Model SFT +Loss RW +NGDiff (PeSFT)
GC Qwen 9.8 21.0 25.0
GC InternVL 10.6 20.6 20.6
Sudoku Qwen 6.4 18.4 22.0
Sudoku InternVL 0.8 14.6 19.0

Text-to-Image Generation

PeSFT, when applied to MM-DiT diffusion backbones on portrait generation, outperforms both the baseline and conventional SFT across photorealism (FID), text-image alignment (CLIPScore/HPS), and aesthetics (ImageReward), with zero inference overhead and preservation of generative diversity. The PeSFT method unifies improvement: FID = 27.40 vs. SFT FID = 27.94, ImageReward = 21.57% vs. SFT 18.60%. The resulting Pareto frontier is strictly improved across all axes (Wang et al., 20 May 2026).

Embodied Multimodal Planning

Reinforced SFT (a PeSFT instantiation for planners) enhances spatio-aligned planning: Success Rate on Meeting Preparation increases from 62.2% (SFT) to 67.6% (RSFT). Notably, RL-only training collapses, whereas PeSFT hybridization consistently enhances perceptual alignment in image-based subgoals (Cai et al., 3 Nov 2025).

6. Practical Recommendations and Limitations

  • Chains-of-thought that minimally allocate tokens to perception require explicit reweighting—either fixed or, preferably, dynamic (NGDiff)—to mitigate perception collapse (Wu et al., 28 May 2026).
  • Feature-level alignment for generative models should be injected at shallow layers and with low weights to avoid overfitting and preserve open-domain generalization (Wang et al., 20 May 2026).
  • Reinforcement-based perception alignment terms must be paired with maximum-likelihood SFT for sample efficiency and robust convergence (Cai et al., 3 Nov 2025).
  • Inference performance and latency are unaffected by PeSFT, as all supervision heads and branches can be ablated post-training.
  • PeSFT applicability is general, requiring only the ability to segment out perception vs. reasoning or to couple vision-aligned features at train time.

Limitations include the need for explicit Perception/Reasoning decomposition or access to high-quality feature extractors, challenges in highly complex or compositional generation scenarios, and, for RL-based PeSFT, the possible need for credible reward proxies. Future work is suggested in expanding feature-guided PeSFT to U-Net and latent score backbones, and in developing lightweight explicit heads for more granular aesthetic or spatial supervision (Wang et al., 20 May 2026).

7. Connections and Implications for Future Research

PeSFT provides a principled, pragmatic framework for directly addressing long-standing perception-reasoning tradeoffs in multimodal supervised fine-tuning. The formal diagnosis of the token imbalance and feature-level insufficient supervision mechanisms enables broad application across reasoning-centric VLMs, text-conditioned image synthesis, and embodied agents.

A plausible implication is that future vision-language architectures may co-design data annotation, output format, and training objectives to maximize token efficiency for perception, or further leverage frozen, richly trained foundation models as cross-modal signal sources. The PeSFT paradigm is likely to serve as a baseline for post-training interventions in any system where perception and higher-order cognitive tasks must be co-optimized with minimal training complexity and maximum practical compatibility (Wu et al., 28 May 2026, Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Perception-Aligned Supervised Fine-Tuning (PeSFT).