PA-SFT: Activating Perceptual Competence

Updated 4 July 2026

PA-SFT is a framework for rebalancing and activating latent perceptual representations in pretrained models through targeted supervised signals.
It unifies methods like latent-capability transfer, process-aware supervision, and loss reweighting to enhance downstream task performance.
Empirical evidence shows improvements in tasks such as OCR, image retrieval, and fine-grained classification, showcasing its practical efficacy.

Perception Activation Supervised Fine-Tuning (PA-SFT) is best understood as an interpretive umbrella for post-pretraining adaptation strategies that aim to surface, sharpen, or rebalance latent perceptual competence in pretrained models through supervised signals. The term itself is generally not formalized in the cited literature. Instead, closely related work describes “unleashing fine-grained knowledge” in CLIP-style vision backbones, enforcing perception-before-reasoning output protocols, or correcting the “asymmetric optimization of reasoning and perception” in chain-of-thought training (Jiang et al., 2024, Jiang et al., 14 Nov 2025, Wu et al., 28 May 2026). Under this reading, PA-SFT denotes a family of methods in which supervision is designed not merely to improve end-task labels, but to make perceptual information more reusable, more explicit, and more influential in downstream inference.

1. Conceptual scope and lineage

PA-SFT is not a single standardized algorithm. It is a unifying lens over several recurrent design choices in recent supervised and post-training research. One line of work assumes that pretrained vision or multimodal backbones already contain substantial fine-grained information and that a relatively small amount of object-level supervision can transfer that knowledge into reusable backbone-side adaptations rather than leaving it trapped in task heads (Jiang et al., 2024). A second line treats adaptation as reconfiguration of internal functional units, especially attention heads, and studies how SFT rapidly changes task-specific activation patterns under limited data (Zhao et al., 2024). A third line makes perception an explicit supervised stage before reasoning, forcing the model to first emit visual evidence and only then reason over it (Jiang et al., 14 Nov 2025).

The literature therefore suggests three overlapping meanings of “perception activation.” First, it can mean latent-capability elicitation: pretrained models already encode useful perceptual structure, but SFT makes it more behaviorally available. Second, it can mean process separation: the perceptual stage is externalized as a distinct object of supervision. Third, it can mean optimization rebalancing: the supervised objective is modified so that short perception spans are not numerically dominated by longer reasoning traces (Wu et al., 28 May 2026).

A concise taxonomy is useful.

Family	Representative formulation	PA-SFT relevance
Latent-capability transfer	ViSFT two-stage backbone adaptation	Activates reusable fine-grained visual knowledge
Process-aware supervision	`<observation>...</observation>> ...</think><answer>...</answer>`	Makes perception explicit before reasoning
>	Segment-aware loss design	Perception/reasoning reweighting and NGDiff
>	Activation-space alignment	Head-pattern analysis, IA2, SAE diagnostics

This framing also distinguishes PA-SFT from two nearby but different paradigms. It is not identical to region-level pretraining: ViSFT explicitly positions itself as a post hoc supervised stage rather than a scalable region-aware pretraining loss (Jiang et al., 2024). Nor is it identical to generic multitask learning: the core claim is that supervision should alter reusable perceptual representations, internal activation patterns, or explicit perceptual stages, rather than merely improving task-specific decoders.

2. Canonical design patterns

A canonical PA-SFT-style design is the two-stage backbone-transfer procedure of ViSFT. Stage 1 trains task heads independently on top of a frozen pretrained vision transformer, optimizing $L_n(\mathbf{y}, T_n(\mathbf{f}))$ with $\mathbf{f}=M(\mathbf{x})$ . Stage 2 then freezes both the pretrained backbone and the trained heads, inserts LoRA adapters into the backbone, and optimizes $L_n'(\mathbf{y}, T_n(\mathbf{f}'))$ with $\mathbf{f}'=M(\mathbf{x};\Delta W)$ . The deliberate point of this decomposition is to force supervision into backbone-side LoRA parameters rather than letting the heads absorb the knowledge. In the reported implementation, ViSFT uses COCO detection, instance segmentation, and image captioning; applies LoRA only to the query and value projections; and for EVA-ViT-E uses rank $r=64$ with 29.4M trainable LoRA parameters, roughly comparable to the 36.8M parameters of all task heads (Jiang et al., 2024).

A second canonical design is process-aware perception-first supervision. VideoP2R standardizes the target sequence as <observation>...</observation> <think>...<answer>...</answer>, where <observation> is the perception process and <think> plus <answer> are the reasoning process. Its SFT dataset, VideoP2R-CoT-162K, is produced by a three-step pipeline: Qwen2.5-VL-72B-Instruct generates process-aware CoTs; outputs are filtered by task-specific metrics with samples below 0.6 removed; and Claude 3.7 Sonnet verifies whether the observation segment alone is sufficient evidence for the answer. All generated CoTs follow the same template at train and inference time (Jiang et al., 14 Nov 2025). In PA-SFT terms, this is explicit perception-stage supervision rather than latent-only shaping.

A third design corrects supervision imbalance directly in the loss. In the controlled vision-language study on asymmetric optimization, standard token-averaged SFT is decomposed into a perception part $\mathcal{L}_p$ and reasoning part $\mathcal{L}_r$ , then replaced by

$\mathcal{L}_{\text{SFT},\lambda}=\lambda \frac{\mathcal{L}_p}{|\mathbf{p}|} + (1-\lambda)\frac{\mathcal{L}_r}{|\mathbf{r}|}.$

The paper further uses NGDiff, with

$\lambda=\frac{1}{\|\mathbf{g}_p\|}\Big/\left(\frac{1}{\|\mathbf{g}_p\|}+\frac{1}{\|\mathbf{g}_r\|}\right),$

to equalize optimization pressure dynamically across the two segments (Wu et al., 28 May 2026). This makes perception activation an objective-level property rather than only a data-format property.

3. Empirical evidence for perception activation

The strongest direct evidence comes from transfer beyond the supervised tasks themselves. In ViSFT, a frozen EVA-ViT-G backbone with a lightweight OCR head improves average OCR accuracy from 44.4 to 46.9 after 5k iterations and to 47.6 after 15k. Grounded Object Identification on M $^3$ IT also improves from 52.3 to 52.9 for EVA-ViT-G and from 54.9 to 55.2 for EVA-ViT-E. Zero-shot classification gains are small but consistent, including ImageNet-A 82.1 to 82.4, while few-shot gains are larger on fine-grained datasets such as Caltech101 92.4 to 94.3 and Aircraft 68.1 to 69.7. Multimodal transfer also improves: on BLIP-2 ViT-G, VQAv2 rises from 51.9 to 53.0 and OK-VQA from 31.5 to 32.8, while COCO image retrieval on EVA-CLIP-E improves from 74.9 to 75.2 at 5k and to 76.0 at 50k (Jiang et al., 2024). This pattern is consistent with the claim that targeted supervision changes representational geometry in a reusable way.

Process-aware SFT yields similarly direct evidence that explicit perceptual supervision matters. In VideoP2R, process-aware SFT alone reaches 55.6 average accuracy, compared with 53.5 for process-agnostic SFT and 52.9 for the base Qwen2.5-VL-7B. The same work also shows that when Qwen2.5-VL-7B is given only text questions augmented with VideoP2R-generated observation segments, it reaches 55.5 average accuracy, which exceeds the raw-video baseline of 52.9 (Jiang et al., 14 Nov 2025). The observation segment is therefore not decorative; it functions as information-sufficient perceptual evidence for downstream reasoning.

The asymmetric-optimization study shows that merely reallocating supervised signal toward perception can produce large end-to-end gains without changing architecture or data. Standard SFT yields 9.8 on Graph Coloring with Qwen and 0.8 on Sudoku with InternVL, while fixed loss reweighting improves these to 21.0 and 14.6. NGDiff further lifts them to 25.0 and 19.0, with the largest reported gain being +18.2 on InternVL Sudoku (Wu et al., 28 May 2026). Here perception activation is not inferred from attention or hidden states; it is observed as a consequence of altering the supervised objective.

Adjacent perception-oriented supervised tuning results strengthen the broader picture. In promptable anomaly segmentation, Self-Perception Tuning improves average performance from 66.5/61.0 mIoU/mBIoU under LoRA to 68.4/62.8, and from the zero-shot SAM baseline of 52.0/47.0 to substantially higher scores while keeping trainable parameters at 0.397% (Yang et al., 2024). In screen-conditioned rationale generation, ordinary multimodal LoRA-SFT on Qwen3-VL-8B-Instruct reaches sem_sim 0.783 on a 661-row held-out slice, while GPT-5.5 zero-shot reaches 0.482 and Claude Opus 4.7 reaches 0.459; however, the same recipe on Gemma-4-26B-A4B-IT reaches only 0.441, showing that perception-grounded SFT can be powerful but architecture-sensitive (Bissa et al., 28 May 2026). ViPER, although primarily RL-based, reports an average gain of 1.7% on seven benchmarks and up to 6.0% on fine-grained perception, providing adjacent evidence that perception-centered post-training can improve fine-grained visual behavior without sacrificing generalization (Zhang et al., 28 Oct 2025).

4. Mechanistic accounts of what changes during PA-SFT

One mechanistic account locates the effect in attention-head usage. The attention-pattern study defines head activation levels by

$\mathbf{f}=M(\mathbf{x})$ 0

and interprets task adaptation as reconfiguration of head-level contribution patterns (Zhao et al., 2024). It reports that complex-task activation changes can be modeled as combinations of basic-task changes, with $\mathbf{f}=M(\mathbf{x})$ 1 for SGSM from Code Search Net plus GSM8K and $\mathbf{f}=M(\mathbf{x})$ 2 for reasoning-plus-programming instructions from corresponding single-skill tasks. It also shows that activation patterns shift rapidly in early SFT and that activation-guided prerequisite data selection improves scarce-data adaptation. Under a PA-SFT reading, this suggests that supervised tuning often recombines pretrained functional units rather than learning entirely new circuits.

A second mechanistic account argues that dense hidden geometry understates the true magnitude of SFT-induced change. Using frozen SAEs pretrained on the base model, the SAE-based investigation finds that raw activation cosine remains above 0.960 across tasks and layers, yet latent cosine can fall sharply, for example to 0.557 at Layer 22 for MultiNLI under SAE-262k (Chopra, 12 May 2026). The altered sparse features cluster into semantic groups including Structure, Persona, Reasoning, Safety, Code, Multilingual, and Collateral. This implies that PA-SFT-like interventions should not equate preservation of dense hidden cosine with preservation of perceptual semantics; meaningful change may occur in a sparse latent basis even when dense vectors appear almost unchanged.

A third account treats internal activations as direct supervision targets. IA2 aligns student hidden activations with those produced by the same base model under in-context learning, using an activation-matching objective of the form $\mathbf{f}=M(\mathbf{x})$ 3, followed by ordinary SFT (Mishra et al., 26 Sep 2025). The paper first shows that ICL and SFT induce different activation patterns, then reports that IA2 as a priming stage can improve both accuracy and calibration; on SST2 with Qwen3-4B and $\mathbf{f}=M(\mathbf{x})$ 4, SFT-only yields 65.2 accuracy and 0.22 ECE, whereas IA2 $\mathbf{f}=M(\mathbf{x})$ 5 SFT yields 90.4 accuracy and 0.06 ECE. This does not concern perception specifically, but it demonstrates that activation-supervised adaptation is practically viable and can alter the optimization trajectory into a better functional subspace.

5. Formalizing PA-SFT as supervision design

A natural formal scaffold for PA-SFT is the Q-target view of supervised fine-tuning. That framework rewrites token supervision as

$\mathbf{f}=M(\mathbf{x})$ 6

with objective

$\mathbf{f}=M(\mathbf{x})$ 7

Standard one-hot SFT is the special case $\mathbf{f}=M(\mathbf{x})$ 8 (Xie et al., 9 Jun 2026). The paper’s central claim is that SFT is fundamentally target distribution design: how much trust to place in the observed token, and how to allocate the remaining mass over alternatives. A plausible PA-SFT interpretation is that perception-derived signals could set $\mathbf{f}=M(\mathbf{x})$ 9, define $L_n'(\mathbf{y}, T_n(\mathbf{f}'))$ 0, or both, thereby turning perceptual reliability into token-level target design rather than only scalar loss reweighting. That mapping is interpretive rather than explicit in the source.

PriFT offers a complementary formalization based on stable token weighting from a frozen pretrained reference. Its generic objective is

$L_n'(\mathbf{y}, T_n(\mathbf{f}'))$ 1

with PriFT-prob using pretrained target-token probability and PriFT-mass using a cumulative-mass threshold at 0.5 (Wang et al., 8 Jun 2026). Across math, code, and medical QA, replacing online weighting signals with pretrained ones consistently improves performance among SFT baselines. For PA-SFT, the transferable principle is that model-derived supervision signals should be reference-anchored and stable rather than entangled with the online optimization trajectory.

These two formalisms expose an important general point. PA-SFT need not be restricted to explicit perception tokens such as <observation> spans. It can also be understood as designing which tokens, segments, or latent states deserve stronger trust because they are better grounded in perception, better supported by a pretrained prior, or more aligned with a desired internal computation. The literature does not yet provide a universally accepted PA-SFT objective, but it does provide multiple compatible mathematical templates for one.

6. Limitations, misconceptions, and open problems

A first limitation is terminological. PA-SFT is not a standardized label in the core papers. ViSFT interprets its effect as “unleashing fine-grained knowledge,” but does not define “perception activation” formally (Jiang et al., 2024). The attention-pattern work studies “alternating attention head activation patterns,” yet provides no explicit masking, head-routing, or head-control optimizer (Zhao et al., 2024). VideoP2R makes perception and reasoning explicit in the output format, but does not provide a separate supervised loss decomposition for the two stages in the SFT phase (Jiang et al., 14 Nov 2025). A common misconception is therefore to treat PA-SFT as if it were already a single settled method. The literature supports it more as a family resemblance than as a canonical algorithm.

A second limitation is the presence of real trade-offs. Process-aware supervision can fail if the perception stage is too compressed: VideoP2R notes that tasks such as VSI-Bench can require long, fine-grained descriptions, and that concise observation regimes may drop critical details (Jiang et al., 14 Nov 2025). Conversely, overemphasizing perception in the loss can weaken reasoning under a fixed budget: the asymmetric-optimization study explicitly reports a perception–reasoning trade-off, with end-to-end accuracy peaking at intermediate weighting rather than maximal perception weight (Wu et al., 28 May 2026). Architecture also matters. In PiSAR, the same managed LoRA recipe yields sem_sim 0.783 on Qwen3-VL-8B-Instruct but only 0.441 on Gemma-4-26B-A4B-IT, suggesting that strong reasoning-tuned priors may resist displacement by ordinary supervised adaptation (Bissa et al., 28 May 2026).

A third limitation is mechanistic uncertainty. SAE-based diagnostics show that semantically meaningful latent change can be large even when dense activations remain almost unchanged, but the method is still post hoc and depends on the faithfulness of the frozen SAE basis (Chopra, 12 May 2026). IA2 demonstrates that hidden-state alignment can help, but it uses ICL-derived activations rather than perceptual activations specifically (Mishra et al., 26 Sep 2025). ViPER shows that reconstruction-driven self-bootstrapping can improve fine-grained perception, but its main training loop is RL rather than SFT (Zhang et al., 28 Oct 2025). This suggests that future PA-SFT research will likely be hybrid: layer-aware, activation-aware, and reconstruction-aware, but also careful about architecture sensitivity, sufficiency of perceptual supervision, and preservation of general capabilities.

Taken together, the literature supports a restrained conclusion. PA-SFT is best treated as a technically meaningful synthesis: supervised or post-pretraining adaptation that explicitly reallocates optimization pressure toward perception, externalizes perceptual states before reasoning, or aligns internal activations and targets so that latent perceptual competence becomes more usable. What remains open is not whether such effects exist, but how to formalize them into stable, architecture-robust, causally grounded training procedures.