PeSFT: Perception-Aligned Supervised Fine-Tuning

Updated 4 June 2026

PeSFT is a supervised fine-tuning approach that explicitly balances perception and reasoning by reweighting loss contributions in multimodal models.
It employs dynamic gradient-norm balancing and feature-space alignment to overcome token imbalance and ensure robust perceptual processing.
Experimental results demonstrate significant improvements in perception accuracy and overall performance across visual reasoning, text-to-image synthesis, and embodied planning tasks.

Perception-Aligned Supervised Fine-Tuning (PeSFT) refers to a class of supervised fine-tuning methodologies designed to explicitly address and optimize the perceptual components of multimodal and vision-LLMs. These approaches aim to balance the learning signal between perception (i.e., accurate extraction or alignment of input modality information such as image-to-text grounding, visual layout, or aesthetics) and downstream tasks such as reasoning, planning, or generation. PeSFT seeks to overcome systematic imbalances in standard supervised fine-tuning regimes that cause perception modules to be under-optimized, resulting in bottlenecks for overall system performance, especially in end-to-end visual reasoning, text-to-image alignment, and complex multimodal workflows (Wu et al., 28 May 2026, Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).

1. Motivation and Problem Formulation

A recurring empirical finding across recent large-scale vision-language and multimodal generation models is an asymmetry in post-training improvements: while supervised fine-tuning (SFT) or post-training significantly enhances reasoning capabilities, perception components (e.g., extracting or aligning low-level scene information or attributes) show limited gains. This discrepancy constrains overall task performance, notably for tasks requiring tightly coupled perception and reasoning or photorealistic, semantically aligned generation.

In the canonical chain-of-thought (CoT) framework for VLMs, the output sequence can be decomposed into two contiguous segments: a perception segment $p$ (grounded in visual/textual input) and a reasoning segment $r$ (goal-directed or logical deduction). Let $\pi_\theta$ be the trained auto-regressive model, $x$ the multimodal input, and $y=(p, r)$ the CoT sequence. Measured metrics commonly include:

Perception Accuracy: $a_p = \mathbb{1}[\hat{p} = p^*]$
Reasoning Accuracy: $a_r = \mathrm{Acc}(r' | p^*)$ (evaluated by sampling reasoning conditioned on oracle perception)
End-to-end Accuracy: $a = \mathrm{Acc}(\hat{r} | p^*)$

Observed during conventional SFT: $a_r$ increases rapidly with training, while $a_p$ remains near baseline, choking end-to-end accuracy (Wu et al., 28 May 2026). This is not confined to VLMs, but also affects text-to-image generators and embodied planning models (Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).

2. Root Causes: Token Imbalance and Feature-Level Misalignment

Token Imbalance in Chain-of-Thought Supervision

Standard SFT computes cross-entropy over the full output sequence $r$ 0:

$r$ 1

When parsing $r$ 2, the perception segment $r$ 3 typically constitutes only 2–3% of tokens, so its contribution to the total loss and gradient is disproportionately small. This starves perception of optimization signal, especially as $r$ 4. Chain-of-thought SFT thus inherently favors reasoning at the expense of perception (Wu et al., 28 May 2026).

Inadequate Feature-Level Alignment in Generative Models

For text-to-image diffusion models (MM-DiT, etc.), naïve SFT on curated datasets may increase photorealism but at the expense of text-image alignment or aesthetic controllability. This reflects a lack of dense, semantically grounded guidance at the feature level—pixel/latent-level targets risk overfitting and prior collapse, failing to resolve the trilemma between alignment, realism, and aesthetics (Wang et al., 20 May 2026).

This suggests that both gradient flow (token imbalance) and supervision granularity (missing vision-aligned feature guidance) are central to perception collapse in most current SFT pipelines.

3. Core Methodologies in PeSFT

Dynamic Loss Reweighting for Perception and Reasoning

A central innovation is the explicit rebalancing of loss contributions from perception and reasoning sub-tasks. Several variants are documented:

Fixed Reweighting: Introducing a static hyperparameter $r$ 5 to upweight the normalized perception loss:

$r$ 6

where $r$ 7, $r$ 8 are total negative log-likelihoods over each segment.

Dynamic Gradient-Norm Balancing (NGDiff): At each mini-batch, compute the $r$ 9 norm of the perception- and reasoning-segment gradients, and set

$\pi_\theta$ 0

yielding an adaptive, tuning-free weighting that equalizes gradient impact and restores perception learning (Wu et al., 28 May 2026).

Feature-Space Alignment in Diffusion Models

For text-to-image diffusion transformers, PeSFT integrates a lightweight supervision path using frozen vision foundation models (e.g., SigLIP 2):

The text prompt is encoded into multi-granularity features (global semantics, patch-level correspondences, structural/aesthetics).
A single shallow image-feature layer is projected to the text feature space via an MLP, and a cosine loss is computed:

$\pi_\theta$ 1

The total training loss is:

$\pi_\theta$ 2

By restricting supervision to a shallow layer and using low $\pi_\theta$ 3, the generator preserves prior diversity and avoids overfitting (Wang et al., 20 May 2026).

Reinforced Supervised Fine-Tuning in Embodied Agents

Perception-aligned objectives in sequential multimodal planning add a reinforcement signal that rewards spatial alignment between generated images and predicted actions:

$\pi_\theta$ 4
$\pi_\theta$ 5 is a policy-gradient loss, with the reward function $\pi_\theta$ 6 quantifying the match in dynamic regions (IoU and MSE) between ground-truth and sampled images (Cai et al., 3 Nov 2025).

4. Algorithmic Implementations

Table: Overview of Representative PeSFT Algorithms

Context	Loss Balancing Method	Implementation Feature
Chain-of-thought SFT	Dynamic gradient-norm (NGDiff)	Adaptive $\pi_\theta$ 7 schedules based on instantaneous gradient norms (Wu et al., 28 May 2026)
Diffusion transformers	Feature-space alignment	Cross-modal cosine loss at shallow layer using SigLIP 2 features (Wang et al., 20 May 2026)
Multimodal planning	Reinforced SFT	Hybrid supervised + RL loss, reward for pixel-space perceptual alignment (Cai et al., 3 Nov 2025)

In all cases, inference-time costs remain unchanged compared to standard SFT; PeSFT modules are only active during training.

5. Experimental Evidence and Quantitative Results

Vision-Language Reasoning (Synthetic Tasks)

On synthetic tasks (Graph Coloring, Sudoku), PeSFT (gradient-norm dynamic reweighting) yields major end-to-end accuracy improvements for both Qwen3-VL-2B-Instruct and InternVL3.5-2B, with gains up to 18.2 percentage points compared to standard SFT. Under PeSFT, perception accuracy ( $\pi_\theta$ 8) rises from <5% to 25–30%, with minimal trade-off to reasoning accuracy. The tradeoff curve of perception vs. reasoning (by varying $\pi_\theta$ 9) is Pareto-like; moderate perception upweighting optimizes overall performance (Wu et al., 28 May 2026).

Task	Model	SFT	+Loss RW	+NGDiff (PeSFT)
GC	Qwen	9.8	21.0	25.0
GC	InternVL	10.6	20.6	20.6
Sudoku	Qwen	6.4	18.4	22.0
Sudoku	InternVL	0.8	14.6	19.0

Text-to-Image Generation

PeSFT, when applied to MM-DiT diffusion backbones on portrait generation, outperforms both the baseline and conventional SFT across photorealism (FID), text-image alignment (CLIPScore/HPS), and aesthetics (ImageReward), with zero inference overhead and preservation of generative diversity. The PeSFT method unifies improvement: FID = 27.40 vs. SFT FID = 27.94, ImageReward = 21.57% vs. SFT 18.60%. The resulting Pareto frontier is strictly improved across all axes (Wang et al., 20 May 2026).

Embodied Multimodal Planning

Reinforced SFT (a PeSFT instantiation for planners) enhances spatio-aligned planning: Success Rate on Meeting Preparation increases from 62.2% (SFT) to 67.6% (RSFT). Notably, RL-only training collapses, whereas PeSFT hybridization consistently enhances perceptual alignment in image-based subgoals (Cai et al., 3 Nov 2025).

6. Practical Recommendations and Limitations

Chains-of-thought that minimally allocate tokens to perception require explicit reweighting—either fixed or, preferably, dynamic (NGDiff)—to mitigate perception collapse (Wu et al., 28 May 2026).
Feature-level alignment for generative models should be injected at shallow layers and with low weights to avoid overfitting and preserve open-domain generalization (Wang et al., 20 May 2026).
Reinforcement-based perception alignment terms must be paired with maximum-likelihood SFT for sample efficiency and robust convergence (Cai et al., 3 Nov 2025).
Inference performance and latency are unaffected by PeSFT, as all supervision heads and branches can be ablated post-training.
PeSFT applicability is general, requiring only the ability to segment out perception vs. reasoning or to couple vision-aligned features at train time.

Limitations include the need for explicit Perception/Reasoning decomposition or access to high-quality feature extractors, challenges in highly complex or compositional generation scenarios, and, for RL-based PeSFT, the possible need for credible reward proxies. Future work is suggested in expanding feature-guided PeSFT to U-Net and latent score backbones, and in developing lightweight explicit heads for more granular aesthetic or spatial supervision (Wang et al., 20 May 2026).

7. Connections and Implications for Future Research

PeSFT provides a principled, pragmatic framework for directly addressing long-standing perception-reasoning tradeoffs in multimodal supervised fine-tuning. The formal diagnosis of the token imbalance and feature-level insufficient supervision mechanisms enables broad application across reasoning-centric VLMs, text-conditioned image synthesis, and embodied agents.

A plausible implication is that future vision-language architectures may co-design data annotation, output format, and training objectives to maximize token efficiency for perception, or further leverage frozen, richly trained foundation models as cross-modal signal sources. The PeSFT paradigm is likely to serve as a baseline for post-training interventions in any system where perception and higher-order cognitive tasks must be co-optimized with minimal training complexity and maximum practical compatibility (Wu et al., 28 May 2026, Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (3)

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training (2026)

Pareto-Enhanced Portrait Generation: Vision-Aligned Text Supervision for Alignment, Realism, and Aesthetics (2026)

EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Perception-Aligned Supervised Fine-Tuning (PeSFT).

PeSFT: Perception-Aligned Supervised Fine-Tuning

1. Motivation and Problem Formulation

2. Root Causes: Token Imbalance and Feature-Level Misalignment

Token Imbalance in Chain-of-Thought Supervision

Inadequate Feature-Level Alignment in Generative Models

3. Core Methodologies in PeSFT

Dynamic Loss Reweighting for Perception and Reasoning

Feature-Space Alignment in Diffusion Models

Reinforced Supervised Fine-Tuning in Embodied Agents

4. Algorithmic Implementations

5. Experimental Evidence and Quantitative Results

Vision-Language Reasoning (Synthetic Tasks)

Text-to-Image Generation

Embodied Multimodal Planning

6. Practical Recommendations and Limitations

7. Connections and Implications for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PeSFT: Perception-Aligned Supervised Fine-Tuning

1. Motivation and Problem Formulation

2. Root Causes: Token Imbalance and Feature-Level Misalignment

Token Imbalance in Chain-of-Thought Supervision

Inadequate Feature-Level Alignment in Generative Models

3. Core Methodologies in PeSFT

Dynamic Loss Reweighting for Perception and Reasoning

Feature-Space Alignment in Diffusion Models

Reinforced Supervised Fine-Tuning in Embodied Agents

4. Algorithmic Implementations

5. Experimental Evidence and Quantitative Results

Vision-Language Reasoning (Synthetic Tasks)

Text-to-Image Generation

Embodied Multimodal Planning

6. Practical Recommendations and Limitations

7. Connections and Implications for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research