Visual Expression SFT Stage

Updated 26 December 2025

Visual Expression SFT is a supervised fine-tuning stage that equips multimodal models to map complex visual signals to structured language tokens, image tokens, or latent embeddings.
It employs curated, high-diversity datasets paired with selective layer fine-tuning and LoRA adapters to enhance image synthesis, region detection, and visual reasoning.
Empirical studies show marked improvements in text-to-image quality, region-level detection accuracy, and chain-of-thought stability, while highlighting challenges with overfitting.

The Visual Expression SFT (Supervised Fine-Tuning) Stage is a supervised, instruction-following adaptation procedure that equips vision-language, generative, and multimodal models with the capability to robustly map between complex visual signals and structured language tokens, high-fidelity image tokens, or latent visual embeddings. This stage is a critical component in contemporary multimodal model training pipelines, variably called “visual instruction tuning,” “visual-expression SFT,” or “region-aware SFT” in the literature. Its defining goal is to expose models—already pretrained on image-text (or vision-language) corpora—to dense, curated, task-driven supervision, thereby establishing precise, aligned, and expressive visual outputs or predictions under naturalistic or synthetic instructions.

1. Conceptual Framework and Objectives

Visual Expression SFT aims to transfer and solidify visual-linguistic capabilities for reasoning, generation, or fine-grained understanding. In unified models such as VARGPT-v1.1, it bridges the gap between broad pretraining and reinforcement learning–based preference alignment by training the model, under direct supervision, to emit image tokens (for generative models), visual region signals (for grounded reasoning models), or continuous visual-latent embeddings (for abstract visual reasoning) given free-form prompts, reasoning questions, or contextual instructions (Zhuang et al., 3 Apr 2025, Wang et al., 26 Nov 2025, Wang et al., 13 Jun 2025).

Typical objectives for this stage include:

Learning to synthesize images per open-ended instructions (text-to-image SFT).
Acquiring the ability to interleave and fuse visual and linguistic tokens in an autoregressive or blockwise sequence.
Predicting explicit region- or bounding-box signals for fine-grained multimodal reasoning or replay.
Compressing perceptual content into continuous latent blocks to enable “thinking in images” within CoT sequences.
Mapping or enhancing sub-region attention or region-level discrimination in vision backbones.

2. Training Data Construction and Structure

Data construction for Visual Expression SFT is typically set apart by size, diversity, and the explicit alignment of image–language or image–region–language pairs. For generative and autoregressive models, datasets often contain tens of millions of instruction–image pairs spanning both real and synthetic sources; for reasoning and grounding models, datasets are typically smaller but more structurally rich, including explicit region-level annotations, auxiliary-drawing tasks, or interleaved chain-of-thought with latent visual components.

Example dataset characteristics:

Model/Framework	Supervised Dataset Size	Core Data Properties
VARGPT-v1.1	8.3M (4.2M real, 4.1M synthetic)	Free-form prompts; diverse text-image synthesis tasks
Monet-7B	125K	Visual/text CoT; mix of real, chart, geometry images
VGR	158.1K	Vision-grounded reasoning; region-bboxes; CoT traces
UAV-VL-R1	19,187	Aerial VQA; 8 task types; high-res, filtered instances

In all cases, data undergoes task-specific preprocessing such as cropping, resizing, region-mapping, or high-quality automated recaptioning (Zhuang et al., 3 Apr 2025, Wang et al., 26 Nov 2025, Guan et al., 15 Aug 2025).

3. Model Architecture, Parameter Updates, and Adaptation Recipes

The canonical Visual Expression SFT setup employs a partially-frozen model architecture, unlocking only those modules responsible for visual decoding, region prediction, or cross-modal adapters. Notable examples include:

VARGPT-v1.1: Only the 32-layer visual decoder and bidirectional generation projectors are unfrozen; language backbone (Qwen2-7B) is frozen (Zhuang et al., 3 Apr 2025).
UAV-VL-R1: LoRA adapters are inserted into the text decoder (rank 32, scaling $\alpha=48$ ), in addition to full visual encoder fine-tuning; all other backbone weights frozen (Guan et al., 15 Aug 2025).
SimpleAR: No architectural changes from pretraining; the full decoder is fine-tuned on the SFT dataset (Wang et al., 15 Apr 2025).
Monet: Only the “latent block” embeddings in the student model are trainable; a custom transformer attention mask enforces the correct flow of information from image to latent embedding to text token (Wang et al., 26 Nov 2025).
ViSFT: LoRA adapters are added to Q/V projections in a frozen vision transformer; jointly trained on COCO instance segmentation, detection, image captioning heads (Jiang et al., 18 Jan 2024).

Certain SFT stages incorporate dedicated region-detection heads (e.g., a 2-layer MLP predicting bounding box coordinates in VGR (Wang et al., 13 Jun 2025)) or latent-alignment mechanisms as part of visual reasoning.

4. Loss Functions, Optimization, and Training Schedules

Visual Expression SFT universally centers on a supervised, maximum-likelihood or cross-entropy objective applied to tokenized visual, language, or mixed streams. In advanced pipelines, auxiliary objectives are introduced:

Next-token prediction: Standard in autoregressive generation; loss applies only to image or visual token positions.
Observation-token alignment (Monet): Cosine embedding loss aligns student and teacher hidden states at key tokens, backpropagating exclusively through visual-latent blocks.
Detection loss (VGR): ℓ₁ + GIoU loss for precise box prediction, combined with cross-entropy over text and special tokens.
Task-weighted joint objectives: In region-level SFT (ViSFT), task sampling distributes SGD across detection, segmentation, and captioning at ratios reflecting dataset priors.

Typical hyperparameters include learning rates $1 \times 10^{-5}$ to $5 \times 10^{-5}$ ; batch sizes 256–1024 (generation), or 1–4 (reasoning); AdamW optimizers; and up to $40\,000$ – $50\,000$ fine-tuning steps or 1 epoch for compact datasets (Zhuang et al., 3 Apr 2025, Wang et al., 26 Nov 2025).

5. Empirical Impact and Evaluative Findings

The primary function of the Visual Expression SFT stage is to install high-fidelity, prompt-aligned image synthesis or fine-grained visual grounding into a model while stabilizing its chain-of-thought or output format for downstream RL. Empirical studies provide several findings:

VARGPT-v1.1: Text-to-image performance on GenEval improves from 0.47 (predecessor) to 0.53 after SFT(+RL); DPG-Bench from 74.65 to 78.59; SFT alone yields “high-fidelity images that match prompts” with further DPO refinement increasing preference scores (Zhuang et al., 3 Apr 2025).
SimpleAR: SFT at high resolution (1024×1024) alone raises GenEval overall to 0.53, compared to <0.50 after pretraining only (512 resolution); RL yields only modest further improvements (Wang et al., 15 Apr 2025).
VGR: SFT-initiated models achieve +4.5–9.0 point increases on complex VQA benchmarks versus LLaVA-NeXT using ~30% of image tokens (Wang et al., 13 Jun 2025).
Monet: Strict latent-only backpropagation and dual attention–alignment supervision results in V* score 82.20% (ablating either reduces to 46–75%) and enables strong OOD reasoning (Wang et al., 26 Nov 2025).
ViSFT: Fine-grained region-level SFT improves OCR word accuracy (+2.5%), adversarial classification (+0.3% on ImageNet-A), and cross-modal retrieval/grounding (Jiang et al., 18 Jan 2024).

6. Limitations, Overfitting Risks, and Interplay with RL

Quantitative and qualitative analyses reveal several limitations inherent to Visual Expression SFT:

Memorization/Overfitting: SFT consistently increases in-distribution performance but often degrades generalization to out-of-domain images or tasks, with out-of-distribution scores dropping even below those of untuned models (Chu et al., 28 Jan 2025, Chen et al., 10 Apr 2025). Notably, scaling SFT compute exacerbates overfitting.
Pseudo-Reasoning and Lock-in: In structured reasoning SFT, models imitate “surface grammar” of expert traces, including spurious reasoning steps and hesitations, locking into inflexible modes that impede RL-driven exploration and solution diversity (Chen et al., 10 Apr 2025).
Weak Visual Features: Vision encoders trained or updated via SFT show diffuse attention, imprecise feature localization, and poorer mask boundary accuracy compared to RL-tuned counterparts (Song et al., 18 Oct 2025).
Necessity for RL: SFT is essential for stabilizing output format and enabling RL verifiers/evaluators to identify correct chains, but cannot itself induce robust, outcome-aligned behavior without subsequent RL. RL or DPO stages (e.g., PIVOT) substantially improve vision-centric accuracy and generalization (Chu et al., 28 Jan 2025, Song et al., 18 Oct 2025).

7. Design Variants and Best Practices

Recent empirical studies support a series of design recommendations:

For text-to-image models, employ large, aesthetically filtered, and diversified instruction–image pairs encompassing both real and synthetic sources, and upsample SFT in multiple resolutions (Zhuang et al., 3 Apr 2025, Wang et al., 15 Apr 2025).
For task-specific region SFT, leverage region-level task heads and LoRA-style adapters to inject fine-grained spatial awareness without catastrophic forgetting (Jiang et al., 18 Jan 2024).
In visual reasoning, enforce explicit region token supervision, or, for latent reasoning, dual alignment between teacher and student representations with custom attention flows (Wang et al., 26 Nov 2025, Wang et al., 13 Jun 2025).
Avoid overfitting by carefully filtering SFT data (e.g., removing “aha” moments in CoT), limiting SFT compute, and initiating RL after SFT format stabilization (Chu et al., 28 Jan 2025, Chen et al., 10 Apr 2025).
When optimizing for downstream RL, SFT should focus on output consistency and format rather than maximizing in-distribution accuracy, preserving capacity for subsequent RL-driven generalization.

References

VARGPT-v1.1: "VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning" (Zhuang et al., 3 Apr 2025)
Monet: "Monet: Reasoning in Latent Visual Space Beyond Images and Language" (Wang et al., 26 Nov 2025)
UAV-VL-R1: "UAV-VL-R1: Generalizing Vision-LLMs via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning" (Guan et al., 15 Aug 2025)
SimpleAR: "SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL" (Wang et al., 15 Apr 2025)
ViSFT: "Supervised Fine-tuning in turn Improves Visual Foundation Models" (Jiang et al., 18 Jan 2024)
VGR: "VGR: Visual Grounded Reasoning" (Wang et al., 13 Jun 2025)
VLAA-Thinking: "SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-LLMs" (Chen et al., 10 Apr 2025)
RL vs. SFT: "RL makes MLLMs see better than SFT" (Song et al., 18 Oct 2025)
SFT Memorizes, RL Generalizes: "SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training" (Chu et al., 28 Jan 2025)