Separate QKV Projections for Vision

Updated 20 November 2025

Separate QKV projections are dedicated linear transformations that independently process visual tokens in Transformer-based models.
They enhance visual grounding and cross-modal alignment by tailoring query, key, and value representations to image-specific features.
Empirical studies show that methods like MHBC and loss-based supervision yield improved accuracy and efficiency in vision-language tasks.

Separate QKV (Query, Key, Value) projections for the vision modality refer to architectural or experimental designs within Transformer-based models where distinct linear transformations are used to generate Q, K, or V representations specifically for visual tokens. This approach can be leveraged for improved cross-modal alignment and computational efficiency in multimodal and pure vision architectures. Contemporary research investigates both the benefits and complexities of intertwining or separating the processing of visual and textual tokens at the attention mechanism level to maximize visual grounding, robustness, and efficiency.

1. Background and Standard Multimodal Attention

In standard multimodal Transformer architectures, embeddings for visual and text tokens are concatenated into a single sequence, subsequently processed through multi-head self-attention layers. This pipeline, as described in "Unveiling Visual Perception in LLMs: An Attention Head Analysis Approach," operates with canonical formulas:

$Q_{l,h} = X_{l-1} W_{Q, l, h},\quad K_{l,h} = X_{l-1} W_{K, l, h},\quad V_{l,h} = X_{l-1} W_{V, l, h}$

where $X$ includes both text tokens $x_1...x_{n_{text}}$ and visual tokens $v_1...v_{n_{vis}}$ after fusion/adaptation, and each head applies layer-specific projections $W_{Q,K,V}$ to the whole sequence. The paper explicitly confirms that standard practice is to process all tokens with a shared set of projection weights after the fusion/adaptation stage, unless otherwise specified (Bi et al., 24 Dec 2024).

2. Motivations for Separate Vision QKV Projections

Several motivations have driven research into separate QKV projections for visual tokens:

Alignment across modalities: Visual and textual tokens may have fundamentally different statistical structures or information content. Applying dedicated projections can help tailor the representational space of Q, K, and V to the unique characteristics of images versus text.
Enhancing visual grounding: As demonstrated in "Direct Visual Grounding by Directing Attention of Visual Tokens," undifferentiated attention in late LLM layers can lead to answer tokens attending minimally to visual tokens, impairing grounding and leading to hallucinations. The introduction of loss terms or modifications that emphasize visual token attention often implies, or can benefit from, specialized processing (Esmaeilkhani et al., 16 Nov 2025).
Computational efficiency: Vision transformers with high-resolution inputs encounter prohibitive quadratic scaling with unified token processing. Designs such as LookupViT, with a multi-head bidirectional cross-attention (MHBC) module, explicitly separate "lookup" (high-res) and "compressed" (low-res, often vision) token streams and process these with separate attention flows (Koner et al., 17 Jul 2024).
Controlling positional bias: In video-LLMs (e.g., Vista-LLaMA) the use of separate distance handling (e.g., omitting relative position encoding between text and visual tokens) functionally acts as a separation of modalities at the attention parameterization level (Ma et al., 2023).

3. Implementation Strategies in Recent Literature

Unimodal vs. Multimodal Attention Layer Designs

Most standard LLM-based MLLMs apply the same QKV projections to all tokens after linear fusion or adaptation. However, variants exist:

LookupViT MHBC design: High-resolution visual (lookup) tokens and low-resolution compressed tokens are projected independently before two cross-attention steps: information gathering from lookup to compressed (using their own $W_Q$ , $W_K$ , $W_V$ ), and global context infusion from compressed back to lookup (Koner et al., 17 Jul 2024).
Vista-LLaMA’s EDVT attention: Applies rotary positional encodings (RoPE) only to text-text pairs; for all queries involving visual keys, projections are performed without positional components, effecting a form of projection specialization. Thus, while not distinct $W_{Q,K,V}$ matrices per modality, there is differential functional handling (Ma et al., 2023).
Proposed attention supervision: "Direct Visual Grounding by Directing Attention of Visual Tokens" reports significant gains not by architectural separation, but by supervision that encourages stronger cross-modal attention, showing gains even when projections are shared (Esmaeilkhani et al., 16 Nov 2025). This suggests that explicit architectural separation is not strictly necessary for grounded attention, but may still be advantageous.

Separate Parameter Adaptation

Some vision-LLMs employ per-modality adapters or fusion blocks (e.g., Q-Former, MLP adapters) prior to the attention stack. For instance, Q-Former creates learnable queries over patch tokens; when combined with attention heads that functionally specialize (as observed in (Bi et al., 24 Dec 2024)), this can act as an implicit separation in QKV projection behavior.

4. Empirical Outcomes and Analyses

The empirical literature shows both the promise and limitations of separate or specialized QKV processing for vision tokens:

Model/Paper	Mechanism	Vision QKV Separation	Key Results
LookupViT (Koner et al., 17 Jul 2024)	MHBC module	Yes (blockwise)	2-3× FLOPs reduction at iso-accuracy; ablating bidirectional cross-attn drops accuracy significantly
Vista-LLaMA (Ma et al., 2023)	EDVT Attention	Functional (distance)	Maintains visual influence over long text, reduces hallucination, +6 pp VideoQA accuracy
Direct Visual Grounding (Esmaeilkhani et al., 16 Nov 2025)	KL Attention Loss	Shared QKV, extra loss	+8–15% gains on geometric/pointing; “true two-way bridge” in grounded attention

A plausible implication is that, while explicit architectural QKV separation for visual tokens can yield computational or modeling benefits, the main axis of performance improvement comes from either bidirectional attention pathways (as in LookupViT), loss-based supervision (as in KLAL), or modulation of attention mechanics (as in EDVT).

5. Analysis of Specialized Visual Heads and Their Roles

Detailed head-level analysis ("Unveiling Visual Perception in LLMs") identifies a small subset of attention heads that specialize in processing visual patches. Metrics such as total visual attention weight ( $W_{total}$ ) and attention concentration ( $C$ ) correlate strongly with visual reasoning benchmark accuracy. These "visual heads" often reside in early-to-middle layers and, when masked, cause disproportionate drops in visual performance, highlighting the functional modularity that can coexist even within shared-projection architectures (Bi et al., 24 Dec 2024). This suggests head-level specialization can arise without explicit per-modality QKV, but benefits from or may be enabled by, separate projection pipelines or strong attention supervision.

6. Implications for Multimodal Transformer Architecture Design

The existing evidence motivates selective, structure-aware QKV separation for vision tasks:

Early and middle layers can benefit from separate or strongly visual-focused projections and/or heads for robust vision-language grounding.
Bidirectionality in attention, as implemented in LookupViT and in loss supervision approaches, is pivotal for maintaining relevance of visual information. The MHBC module in LookupViT, for example, demonstrates that gathering and diffusing information between heterogeneous token streams via separate (and recycled) attention steps bridges the computational/representational gap between modalities (Koner et al., 17 Jul 2024).
Positional treatment (as in Vista-LLaMA) should be modality-aware, further supporting the design of non-uniform, modality-sensitive attention parameterization (Ma et al., 2023).

7. Limitations and Future Directions

Direct empirical comparisons between explicit QKV separation and strong head/loss-level supervision remain limited. Most state-of-the-art MLLMs adopt shared QKV projections after intermediate adapters, relying on emergent specialization in attention heads. Research continues into architectures with separate, possibly parameter-tied, projections for vision and text modalities at various stages of the model, as well as into scalable methods for enforcing or diagnosing cross-modal specialization without architectural overhead (Bi et al., 24 Dec 2024, Esmaeilkhani et al., 16 Nov 2025). Future work also targets fully bidirectional grounding, dynamic attention routing, and modality-adaptive mechanisms for emerging 3D, temporal, and reasoning tasks.

References:

"Unveiling Visual Perception in LLMs: An Attention Head Analysis Approach" (Bi et al., 24 Dec 2024)
"v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning" (Chung et al., 24 May 2025)
"LookupViT: Compressing visual information to a limited number of tokens" (Koner et al., 17 Jul 2024)
"Direct Visual Grounding by Directing Attention of Visual Tokens" (Esmaeilkhani et al., 16 Nov 2025)
"Vista-LLaMA: Reducing Hallucination in Video LLMs via Equal Distance to Visual Tokens" (Ma et al., 2023)