PeSFT: Perception-Aligned Supervised Fine-Tuning
- PeSFT is a supervised fine-tuning approach that explicitly balances perception and reasoning by reweighting loss contributions in multimodal models.
- It employs dynamic gradient-norm balancing and feature-space alignment to overcome token imbalance and ensure robust perceptual processing.
- Experimental results demonstrate significant improvements in perception accuracy and overall performance across visual reasoning, text-to-image synthesis, and embodied planning tasks.
Perception-Aligned Supervised Fine-Tuning (PeSFT) refers to a class of supervised fine-tuning methodologies designed to explicitly address and optimize the perceptual components of multimodal and vision-LLMs. These approaches aim to balance the learning signal between perception (i.e., accurate extraction or alignment of input modality information such as image-to-text grounding, visual layout, or aesthetics) and downstream tasks such as reasoning, planning, or generation. PeSFT seeks to overcome systematic imbalances in standard supervised fine-tuning regimes that cause perception modules to be under-optimized, resulting in bottlenecks for overall system performance, especially in end-to-end visual reasoning, text-to-image alignment, and complex multimodal workflows (Wu et al., 28 May 2026, Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).
1. Motivation and Problem Formulation
A recurring empirical finding across recent large-scale vision-language and multimodal generation models is an asymmetry in post-training improvements: while supervised fine-tuning (SFT) or post-training significantly enhances reasoning capabilities, perception components (e.g., extracting or aligning low-level scene information or attributes) show limited gains. This discrepancy constrains overall task performance, notably for tasks requiring tightly coupled perception and reasoning or photorealistic, semantically aligned generation.
In the canonical chain-of-thought (CoT) framework for VLMs, the output sequence can be decomposed into two contiguous segments: a perception segment (grounded in visual/textual input) and a reasoning segment (goal-directed or logical deduction). Let be the trained auto-regressive model, the multimodal input, and the CoT sequence. Measured metrics commonly include:
- Perception Accuracy:
- Reasoning Accuracy: (evaluated by sampling reasoning conditioned on oracle perception)
- End-to-end Accuracy:
Observed during conventional SFT: increases rapidly with training, while remains near baseline, choking end-to-end accuracy (Wu et al., 28 May 2026). This is not confined to VLMs, but also affects text-to-image generators and embodied planning models (Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).
2. Root Causes: Token Imbalance and Feature-Level Misalignment
Token Imbalance in Chain-of-Thought Supervision
Standard SFT computes cross-entropy over the full output sequence 0:
1
When parsing 2, the perception segment 3 typically constitutes only 2–3% of tokens, so its contribution to the total loss and gradient is disproportionately small. This starves perception of optimization signal, especially as 4. Chain-of-thought SFT thus inherently favors reasoning at the expense of perception (Wu et al., 28 May 2026).
Inadequate Feature-Level Alignment in Generative Models
For text-to-image diffusion models (MM-DiT, etc.), naïve SFT on curated datasets may increase photorealism but at the expense of text-image alignment or aesthetic controllability. This reflects a lack of dense, semantically grounded guidance at the feature level—pixel/latent-level targets risk overfitting and prior collapse, failing to resolve the trilemma between alignment, realism, and aesthetics (Wang et al., 20 May 2026).
This suggests that both gradient flow (token imbalance) and supervision granularity (missing vision-aligned feature guidance) are central to perception collapse in most current SFT pipelines.
3. Core Methodologies in PeSFT
Dynamic Loss Reweighting for Perception and Reasoning
A central innovation is the explicit rebalancing of loss contributions from perception and reasoning sub-tasks. Several variants are documented:
- Fixed Reweighting: Introducing a static hyperparameter 5 to upweight the normalized perception loss:
6
where 7, 8 are total negative log-likelihoods over each segment.
- Dynamic Gradient-Norm Balancing (NGDiff): At each mini-batch, compute the 9 norm of the perception- and reasoning-segment gradients, and set
0
yielding an adaptive, tuning-free weighting that equalizes gradient impact and restores perception learning (Wu et al., 28 May 2026).
Feature-Space Alignment in Diffusion Models
For text-to-image diffusion transformers, PeSFT integrates a lightweight supervision path using frozen vision foundation models (e.g., SigLIP 2):
- The text prompt is encoded into multi-granularity features (global semantics, patch-level correspondences, structural/aesthetics).
- A single shallow image-feature layer is projected to the text feature space via an MLP, and a cosine loss is computed:
1
The total training loss is:
2
By restricting supervision to a shallow layer and using low 3, the generator preserves prior diversity and avoids overfitting (Wang et al., 20 May 2026).
Reinforced Supervised Fine-Tuning in Embodied Agents
Perception-aligned objectives in sequential multimodal planning add a reinforcement signal that rewards spatial alignment between generated images and predicted actions:
- 4
- 5 is a policy-gradient loss, with the reward function 6 quantifying the match in dynamic regions (IoU and MSE) between ground-truth and sampled images (Cai et al., 3 Nov 2025).
4. Algorithmic Implementations
Table: Overview of Representative PeSFT Algorithms
| Context | Loss Balancing Method | Implementation Feature |
|---|---|---|
| Chain-of-thought SFT | Dynamic gradient-norm (NGDiff) | Adaptive 7 schedules based on instantaneous gradient norms (Wu et al., 28 May 2026) |
| Diffusion transformers | Feature-space alignment | Cross-modal cosine loss at shallow layer using SigLIP 2 features (Wang et al., 20 May 2026) |
| Multimodal planning | Reinforced SFT | Hybrid supervised + RL loss, reward for pixel-space perceptual alignment (Cai et al., 3 Nov 2025) |
In all cases, inference-time costs remain unchanged compared to standard SFT; PeSFT modules are only active during training.
5. Experimental Evidence and Quantitative Results
Vision-Language Reasoning (Synthetic Tasks)
On synthetic tasks (Graph Coloring, Sudoku), PeSFT (gradient-norm dynamic reweighting) yields major end-to-end accuracy improvements for both Qwen3-VL-2B-Instruct and InternVL3.5-2B, with gains up to 18.2 percentage points compared to standard SFT. Under PeSFT, perception accuracy (8) rises from <5% to 25–30%, with minimal trade-off to reasoning accuracy. The tradeoff curve of perception vs. reasoning (by varying 9) is Pareto-like; moderate perception upweighting optimizes overall performance (Wu et al., 28 May 2026).
| Task | Model | SFT | +Loss RW | +NGDiff (PeSFT) |
|---|---|---|---|---|
| GC | Qwen | 9.8 | 21.0 | 25.0 |
| GC | InternVL | 10.6 | 20.6 | 20.6 |
| Sudoku | Qwen | 6.4 | 18.4 | 22.0 |
| Sudoku | InternVL | 0.8 | 14.6 | 19.0 |
Text-to-Image Generation
PeSFT, when applied to MM-DiT diffusion backbones on portrait generation, outperforms both the baseline and conventional SFT across photorealism (FID), text-image alignment (CLIPScore/HPS), and aesthetics (ImageReward), with zero inference overhead and preservation of generative diversity. The PeSFT method unifies improvement: FID = 27.40 vs. SFT FID = 27.94, ImageReward = 21.57% vs. SFT 18.60%. The resulting Pareto frontier is strictly improved across all axes (Wang et al., 20 May 2026).
Embodied Multimodal Planning
Reinforced SFT (a PeSFT instantiation for planners) enhances spatio-aligned planning: Success Rate on Meeting Preparation increases from 62.2% (SFT) to 67.6% (RSFT). Notably, RL-only training collapses, whereas PeSFT hybridization consistently enhances perceptual alignment in image-based subgoals (Cai et al., 3 Nov 2025).
6. Practical Recommendations and Limitations
- Chains-of-thought that minimally allocate tokens to perception require explicit reweighting—either fixed or, preferably, dynamic (NGDiff)—to mitigate perception collapse (Wu et al., 28 May 2026).
- Feature-level alignment for generative models should be injected at shallow layers and with low weights to avoid overfitting and preserve open-domain generalization (Wang et al., 20 May 2026).
- Reinforcement-based perception alignment terms must be paired with maximum-likelihood SFT for sample efficiency and robust convergence (Cai et al., 3 Nov 2025).
- Inference performance and latency are unaffected by PeSFT, as all supervision heads and branches can be ablated post-training.
- PeSFT applicability is general, requiring only the ability to segment out perception vs. reasoning or to couple vision-aligned features at train time.
Limitations include the need for explicit Perception/Reasoning decomposition or access to high-quality feature extractors, challenges in highly complex or compositional generation scenarios, and, for RL-based PeSFT, the possible need for credible reward proxies. Future work is suggested in expanding feature-guided PeSFT to U-Net and latent score backbones, and in developing lightweight explicit heads for more granular aesthetic or spatial supervision (Wang et al., 20 May 2026).
7. Connections and Implications for Future Research
PeSFT provides a principled, pragmatic framework for directly addressing long-standing perception-reasoning tradeoffs in multimodal supervised fine-tuning. The formal diagnosis of the token imbalance and feature-level insufficient supervision mechanisms enables broad application across reasoning-centric VLMs, text-conditioned image synthesis, and embodied agents.
A plausible implication is that future vision-language architectures may co-design data annotation, output format, and training objectives to maximize token efficiency for perception, or further leverage frozen, richly trained foundation models as cross-modal signal sources. The PeSFT paradigm is likely to serve as a baseline for post-training interventions in any system where perception and higher-order cognitive tasks must be co-optimized with minimal training complexity and maximum practical compatibility (Wu et al., 28 May 2026, Wang et al., 20 May 2026, Cai et al., 3 Nov 2025).