Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

Published 12 May 2026 in cs.CV, cs.AI, and cs.LG | (2605.12374v2)

Abstract: Visual latent reasoning lets a multimodal LLM (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

Abstract PDF Upgrade to Chat

Authors (11)

Summary

The paper introduces GAP, a new alignment strategy to stabilize visual latent reasoning in MLLMs by reconciling feature-space mismatches.
It leverages a three-level approach—data, feature, and model-level alignment—with a lightweight PCA-aligned latent head to improve both interpretability and accuracy.
Experimental results show notable gains in perception and reasoning metrics, confirming that controlled latent feedback enhances multimodal reasoning performance.

Granular Alignment Paradigm for Visual Latent Reasoning in MLLMs

Motivation and Feature-Space Mismatch Analysis

This paper introduces the Granular Alignment Paradigm (GAP) for visual latent reasoning in multimodal LLMs (MLLMs), focusing on the empirical instability and lack of transferability observed in prior latent-token generation schemes. Recent approaches have attempted to interleave visual and linguistic evidence within the token generation pipeline by creating continuous visual latent tokens, but these outputs are often fed back as subsequent latent inputs, creating a fundamental feature-space mismatch and optimization instability. In pre-norm transformer architectures, decoder hidden states exhibit substantially larger norms than the input vision embeddings, as evidenced by profiling Monet-7B, which shows text hidden states reaching $546\times$ the norm of input text embeddings and vision hidden states $8.7\times$ the norm of vision inputs.

Figure 1: Layer-wise hidden-state norm growth demonstrates sharp norm escalation across model depth, indicating feature-space mismatch between output hidden states and input embeddings.

Direct latent feedback via output hidden states disrupts the input calibration, destabilizing the autoregressive feedback loop. The authors demonstrate that simple training-free Exponential Moving Average (EMA) norm matching at inference improves accuracy by $+2.00$ on MathVista, showing the feature-level compatibility as a practical factor for stable latent reuse. However, norm rescaling alone addresses only the symptom, motivating their more holistic alignment strategy.

Three-Level Alignment Design: GAP Methodology

GAP proposes a multi-level alignment approach—data-level, feature-level, and model-level—to reconcile the generated latents with the model input distribution and improve interpretability and training stability.

Figure 2: Conceptual overview of GAP, illustrating context-grounded latent supervision, PCA-aligned latent head, and difficulty-aware supervision as integrated alignment mechanisms.

Data-Level Alignment

Inspectability is achieved by pairing each training query with an auxiliary image whose frozen-ViT embeddings serve as latent supervision targets, without being visible to the model at inference. Responses interleave text reasoning, <latent> visual latent tokens, <parser> textual descriptions, and continued textual reasoning. The curated dataset contains 49,309 examples emphasizing visual chain-of-thought (CoT), chart analysis, multimodal math, geometry, visual search, and counting.

Figure 3: Composition of the 49K curated training set, highlighting Visual CoT, chart, GEOQA, and multimodal math as primary sources for latent-target supervision.

Feature-Level Alignment

Feature-level alignment is realized by a lightweight PCA-aligned latent head, which maps backbone-normalized decoder states into the principal subspace of vision embeddings. Instead of predicting full-dimensional embeddings, the head predicts low-rank PCA coefficients that are reconstructed into the empirical vision input space, returning generated latents to the input-compatible regime before autoregressive feedback. This capacity control both regularizes the latent head and enhances compatibility with the vision encoder's empirical distribution.

Figure 4: Interleaved visual latent inference and latent-token generation, depicting the autoregressive feedback loop and PCA-aligned latent head reconstruction pathway.

Model-Level Alignment

Difficulty-aware latent supervision minimizes unnecessary latent-target assignment by empirically estimating base-model accuracy per training example and assigning latent supervision only when the base model underperforms. This selective supervision reduces noise and prevents degradation on instances already solved by the base model, leading to improved training stability.

Experimental Results and Ablation Studies

The GAP model, trained on Qwen2.5-VL 7B, is evaluated across HRBench4K, MMStar, MME-RealWorld-Lite, MathVista, and WeMath. GAP achieves the best aggregate perception and reasoning scores among supervised variants, with notable improvements:

Perception aggregate (Avg-P): GAP achieves $61.32$, outperforming prior latent baselines (Monet-7B: $59.58$, LVR: $60.75$).
Reasoning aggregate (Avg-R): GAP achieves $53.97$, higher than Monet-7B ($47.99$) and LVR ($47.66$).

Difficulty-aware latent assignment, PCA-aligned latent head, and curated training data each contribute to the numerical gains. Ablations indicate that PCA-based capacity control is advantageous; retaining 95% variance ( $k=629$ components) results in a relative reconstruction error of $8.7\times$ 0 and the highest three-benchmark average (Avg-3: $8.7\times$ 1). Zero-latent checkpoint and noise-infused latent interventions demonstrate that clean latent generation provides task-relevant signal at inference, improving performance from $8.7\times$ 2 to $8.7\times$ 3 in Avg-2 (HRBench4K and MathVista).

Token-budget analysis reveals performance saturation—36 tokens yield optimal results, while excessive tokens can degrade performance without corresponding accuracy gains.

Figure 5: Effect of latent-token budget, showing performance across token count sweeps—visual latent capacity benefits performance but exhibits saturation and over-capacity effects.

Practical and Theoretical Implications

GAP demonstrates that robust visual latent reasoning requires geometric alignment in the latent-token feedback loop, inspection capability in supervision, and selective training signal. This paradigm enables MLLMs to generate interpretable, contentful visual latents within a single model architecture, obviating the need for external tools or vision generator calls that complicate inference and system design. By constraining the latent vector space to the empirical vision-embedding subspace, GAP reduces error propagation due to distribution mismatch and enhances generalization within visual reasoning benchmarks.

The release of a high-quality multimodal latent-supervision dataset further provides resources for future exploration in visual reasoning and latent feedback calibration. Theoretical implications extend to understanding norm accumulation in residual streams, subspace parameterization for capacity control, and the necessity of aligning latent feedback with the input geometry of deep transformer models.

Future Directions

The GAP methodology prompts several future research avenues, including the exploration of random-basis and matched-parameter low-rank alternatives, parser-latent faithfulness verification, adaptation to post-norm architectures, and reinforcement learning-based latent training. Additional work is necessary to confirm universality across diverse MLLM backbones and to address off-distribution hallucination, as evidenced by POPE OOD evaluations. Further scaling and automated latent-token budgeting may advance efficiency/accuracy trade-offs.

Conclusion

GAP systematically addresses feature-space incompatibility and supervision noisiness in visual latent modeling for MLLMs, attaining superior perception and reasoning performance by integrating PCA-aligned latent head, context-grounded supervision, and difficulty-aware assignment. The proposed alignment paradigm validates that content-bearing, input-compatible latent feedback is critical for stable and interpretable visual reasoning, setting robust foundations for future advances in multimodal cognitive architectures (2605.12374).

Markdown Report Issue