Bidirectional Attention on Visual Tokens

Updated 20 November 2025

Bidirectional attention on visual tokens is a mechanism in transformer architectures that enables iterative exchange between high-resolution lookup tokens and compressed tokens for efficient, cross-modal processing.
This approach leverages gather and scatter operations to aggregate local and global visual features, dramatically reducing computational costs while maintaining spatial resolution.
Implemented in models like LookupViT and multimodal LLMs, it demonstrates improved accuracy, robustness, and interpretability across various vision and multimodal tasks.

Bidirectional attention on visual tokens refers to mechanisms within Transformer-based architectures that enable learnable and iterative exchange of information between distinct sets of tokens representing visual inputs. This concept appears in state-of-the-art vision and multimodal networks, facilitating both computational efficiency and enhanced cross-modal reasoning. Two principal instantiations are (1) bidirectional cross-attention in vision transformers for computational reduction via compressed tokens (Koner et al., 17 Jul 2024), and (2) specialized attention heads in multimodal LLMs (MLLMs) that mediate attention between text and visual tokens, supporting emergent visual reasoning (Bi et al., 24 Dec 2024).

1. Architectural Foundations of Bidirectional Attention on Visual Tokens

Bidirectional attention on visual tokens commonly involves two streams of representations—high-resolution “lookup tokens” encoding local visual features, and a smaller set of “compressed tokens” that aggregate and process global information. In architectures such as LookupViT, these streams are linked by a sequence of cross-attention operations. The process first gathers information from lookup to compressed tokens (gather), refines the compressed tokens (refine), and then redistributes refined information back to lookup tokens (scatter), forming a bidirectional flow.

In multimodal LLMs, the concept manifests as attention heads that explicitly direct mass from text tokens to visual tokens and reciprocally from visual tokens to text tokens within a unified attention matrix, supporting emergent cross-modal integration.

2. Mathematical Frameworks for Bidirectional Attention

Vision Transformers (LookupViT)

Let $z_\ell \in \mathbb{R}^{N \times D}$ denote lookup tokens and $z_p \in \mathbb{R}^{M \times D}$ compressed tokens, with $M \ll N$ . Bidirectional attention comprises two blocks:

Lookup $\to$ Compressed Cross-attention:

$Q_p = W_Q z_p, \quad K_\ell = W_K z_\ell, \quad V_\ell = W_V z_\ell$

For each head $i$ , compute

$A^i = \text{softmax}\left( Q_p^i (K_\ell^i)^\top/\sqrt{D_h} \right)$

$O_p^i = A^i V_\ell^i$

Concatenate heads and apply output projection.

Compressed $\to$ Lookup Cross-attention (Scattering, using $A^i$ ):

$V'_p = W'_V z_p$

$O_\ell^i = (A^i)^\top V'_p^i$

Again, concatenate and project.

Multimodal LLMs

For attention matrix $\alpha_{l,h}(i, j)$ in layer $l$ , head $h$ :

Text $\to$ Vision:

$A^{\text{text} \to \text{vis}}_{l,h} = \sum_{j \in \mathcal{I}} \alpha_{l,h}(i^*, j)$

where $i^*$ indexes a textual output token and $\mathcal{I}$ is the set of visual token indices.

Vision $\to$ Text:

$A^{\text{vis} \to \text{text}}_{l,h} = \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} \sum_{j \in \mathcal{T}} \alpha_{l,h}(i, j)$

with $\mathcal{T}$ indexing text tokens.

These measures quantify bidirectional cross-modal attention within standard self-attention matrices (Bi et al., 24 Dec 2024).

3. Implementation Strategies and Efficiency Considerations

Efficient implementation of bidirectional attention exploits parallelization and standard attention kernels on modern hardware. In LookupViT, attention is computed between $N$ lookup and $M$ compressed tokens with $M \ll N$ , dramatically reducing computational cost relative to full self-attention over $N$ tokens. The matrix $A$ ( $\mathbb{R}^{H \times M \times N}$ ) computed in the gather phase is cached for re-use in the scatter pass, incurring modest additional memory overhead. The heavy MLP is applied only to compressed tokens, while a lightweight MLP is used for lookup tokens.

For MLLMs, no explicit architectural separation exists—bidirectional attention is emergent within certain attention heads, which can be identified and quantified via metrics such as total visual attention weight and concentration (entropy-based measures). These heads predominantly operate in early and late model layers, mediating substantive bidirectional flow between textual and visual tokens (Bi et al., 24 Dec 2024).

4. Empirical Behavior and Specialized Attention Patterns

In LookupViT, cross-attention proceeds in two clearly delineated phases: first, the compressed tokens aggregate salient information from all lookup tokens; second, after internal refinement, information is selectively redistributed to update the lookup tokens.

In MLLMs, attention behaviors exhibit structured layer-wise dynamics. Visual-token–focused heads are identified by high $A^{\text{text} \to \text{vis}}$ and high concentration. Empirical results show that in early layers ( $l \approx 0$ –5), many heads route a large portion ( $\gtrsim 0.3$ ) of attention mass from text to visual tokens. Mid-layer heads (e.g., $l \approx 18$ –24) show resurgence of high visual focus, with the pattern being more distinct in larger models (e.g., 13B vs. 7B). Symmetry of attention matrices ensures substantial return flow from vision to text, stabilizing at $\sim 15\%$ attention in mid layers (Bi et al., 24 Dec 2024).

5. Performance, Optimization, and Hyperparameters

The core computational advantage in LookupViT stems from the reduction in quadratic FLOP costs relative to standard ViT blocks. For $M \approx N/4$ , total FLOPs reduce by approximately $2\times$ , with quadratic terms on $N^2$ diminished by a factor of 3–4. Empirical benchmarks confirm that this reduction is achieved without accuracy degradation and frequently with improved robustness on visual data (e.g., $+4\%$ accuracy on ImageNet-C/R/A/O vs. ViT) (Koner et al., 17 Jul 2024).

Key tunables include $M$ (number of compressed tokens), $H$ (attention heads), $D$ (hidden dimension), and MLP expansion/reduction factors $p$ and $q$ . LookupViT models can be trained using a uniform random sweep of $M$ for multi-resolution robustness and then deployed with a chosen $M$ for compute-adaptive inference.

For MLLMs, the degree and distribution of bidirectional attention, as captured by the $A$ and concentration $C$ metrics, correlates strongly (Pearson’s $r > 0.8$ ) with downstream task performance within model families, but weakly across architectures. Heads in the upper-right $(A, C)$ quadrant correspond to high-performing heads, and selective pruning of late-stage (non-visual) heads is empirically found to be feasible without loss of accuracy (Bi et al., 24 Dec 2024).

6. Theoretical and Practical Implications

Bidirectional attention on visual tokens introduces architectures capable of flexibly concentrating computational resources where information density is highest—compressed tokens—while maintaining high spatial resolution via lookup tokens. Practically, this design generalizes to diverse vision and multimodal tasks and is compatible with hardware acceleration due to reliance on standard operator primitives. In the multimodal LLM context, bidirectionally attentive heads reveal emergent circuits that perform coarse-to-fine cross-modal binding, with evidence suggesting that high-capacity models develop a two-stage integration mechanism: early layers for broad association, and later layers for precise, focused integration (Bi et al., 24 Dec 2024).

A plausible implication is that explicit architectural allocation of cross-modal heads, pruning of non-contributing heads, and targeted deployment of adapter modules at concentration peaks could yield further FLOP savings and model interpretability.

7. Limitations and Open Research Questions

Current methods for visual-token bidirectional attention face several open challenges. LookupViT’s effectiveness is validated primarily on industry-standard image and video classification tasks; generalization to more complex scenarios or generative modeling remains to be established. For MLLMs, existing analysis centers on single-token output scenarios (e.g., PointQA); whether similar head patterns emerge in multi-token or grounding tasks is an unresolved question. Direct profiling of pure vision-to-text attention in LLMs has not been conducted—current conclusions on bidirectionality are implied by symmetry arguments. Weaknesses include the use of heuristically chosen thresholds for “visual head” detection and the reliance on frozen vision encoders; the effect of joint fine-tuning on attention patterns is unknown (Bi et al., 24 Dec 2024).

These findings collectively underscore the central role of bidirectional attention over visual tokens in modern efficient and interpretable visual and multimodal transformer architectures, while delineating a rich landscape for further empirical and theoretical inquiry.