Visual Reference Tokens in Multimodal Models

Updated 12 April 2026

Visual Reference Tokens (VRTs) are discrete or contextualized tokens that explicitly link visual signals to text, forming a unified basis for interpretable multimodal reasoning.
Methodological approaches for VRTs include contextual neighbor search, discrete tokenization with unified vocabularies, pointer-based retrieval, and dynamic token scaling for adaptive inference.
Practical applications of VRTs span image retrieval, object grounding, and enhanced multimodal reasoning, with demonstrated improvements in interpretability, segmentation accuracy, and retrieval performance.

Visual Reference Tokens (VRTs) are a central abstraction in modern vision-language and multimodal models, denoting discrete or contextualized tokens that serve as explicit, interpretable links between visual signals and linguistic representations. VRTs emerge in multiple methodological frameworks: as contextualized text fragments tightly describing a visual latent, as explicit patch or region indices within a shared vocabulary, or as dynamically constructed grounding entities accessed or emitted during reasoning. VRTs are foundational for bridging the gap between vision and language in neural network architectures, underpinning both model interpretability and fine-grained multimodal reasoning.

1. Mathematical Definitions and Instantiations

VRTs admit several formal definitions, each reflecting the modeling paradigm:

LatentLens contextual alignment: For a vision-LLM with vision connector $E_{vis} : \mathbb{R}^{d_{vis}} \to \mathbb{R}^d$ (typically a shallow MLP), let $v_i$ be the visual embedding for patch $i$ , $h_i^0 = f_{MLP}(v_i)$ its LLM-space projection, and $h_i^{\ell}$ its representation at transformer layer $\ell$ . Using a large text corpus $\mathcal{C}$ , each contextualized token embedding $r_{j,k}^{\ell}$ (sentence $j$ , position $k$ ) builds a corpus $v_i$ 0 of textual representations. The VRTs for visual token $v_i$ 1 at layer $v_i$ 2 are defined as

$v_i$ 3

where $v_i$ 4 is the textual surface token (Krojer et al., 31 Jan 2026).

Patch, quantized, and pointer-based formulations: In auto-regressive and tokenized architectures, VRTs are discrete tokens (e.g., from a 16 384-size codebook) corresponding to quantized image regions or patch embeddings, possibly with spatial indices or offsets to embed them in a joint vocabulary (Ma et al., 2024, Su et al., 2 Oct 2025, Chung et al., 24 May 2025). In pointer architectures, VRTs are symbolic pointer tokens $v_i$ 5 referencing the $v_i$ 6-th vision patch embedding (Chung et al., 24 May 2025).
Semantically meaningful and object-grounded tokens: In compositional or graph-based models, VRTs are embeddings of objects (tangible tokens) or relations/actions (intangible tokens), typically extracted via segmentation (SEEM) or relationship detection (RAM) and aligned to textual descriptions through contrastive objectives (Kalibhat et al., 2024).
Dynamic and iterative VRTs: In models with iterative visual tool invocation, the VRT set is dynamically grown or contracted during inference, with each VRT corresponding to an explicit “visual module” invocation (crop, depth, OCR, etc.) producing a new visual reference embedding (Bai et al., 8 Jun 2025).

2. Methodological Approaches for VRT Construction and Retrieval

Several key methodological paradigms underlie VRT construction:

Contextual alignment via neighbor search: LatentLens compares each visual token at each model layer to a massive bank of textual token hidden states, returning the top- $v_i$ 7 contextual neighbors as VRTs, enabling fine-grained layerwise interpretability (Krojer et al., 31 Jan 2026).
Discrete tokenization and vocabulary merging: ClawMachine and PaDT quantize vision backbone outputs to a discrete codebook, merging visual and language tokens into a unified vocabulary (e.g., 16 384 visual + 32 000 language tokens in LaVIT-7B), enabling direct region grounding and token-by-token referencing via the same auto-regressive head (Ma et al., 2024, Su et al., 2 Oct 2025).
Pointer/point-and-copy retrievers: In the “point-and-copy” approach, VRTs are generated as pointers over vision patch indices. When a pointer token is emitted, the model copies the associated vision patch embedding into the token stream, enabling simultaneous autoregressive reasoning and visual re-grounding (Chung et al., 24 May 2025).
Module-based visual token scaling: Visual token scaling frameworks define VRTs as outputs of user- or reasoner-invoked visual modules (grounding, segment, OCR), with each new visual reference augmenting the active token set during a multi-step MDP-modeled reasoning trajectory (Bai et al., 8 Jun 2025).
Semantic/concrete phrase detection: For translation and compositional tasks, VRTs are defined as linguistically concrete, visually grounded words or phrases, detected via NLP (WordNet concretization), object detection (MDETR), or joint verification. High-confidence noun/noun-phrase tokens are selected and masked during training to promote cross-modal context usage (Bowen et al., 2024).

3. Architectural Integration and Token Usage

VRTs are integrated into model architectures via diverse mechanisms:

Frozen LLM integration: Visual tokens projected into LLM embedding space are fed into frozen LLMs as first-class “words”; VRTs are then interpretable as the nearest contextualized text fragments (Top- $v_i$ 8 neighbors) (Krojer et al., 31 Jan 2026).
Joint embedding and dynamic vocabulary: Models such as ClawMachine and PaDT interleave visual and textual tokens within the same embedding space and vocabulary, with visual tokens indexed and sorted spatially or by occurrence, permitting both region referring (describe a region given VRTs) and region grounding (emit the corresponding VRTs for a region described in text) (Ma et al., 2024, Su et al., 2 Oct 2025).
Pointer augmentation of decoding logits: In v1, pointer logits over visual patches are concatenated with textual logits, and softmax is operated over the merged space. When the pointer token is chosen, the continuous patch embedding is copied into the decoding context, sustaining multi-step visual access during chain-of-thought (Chung et al., 24 May 2025).
Self-attention with graph-structured biasing: VRTs extracted as object/relation instances are injected into transformers with additive attention weights imposed by scene graphs or spatial adjacency. This biases model attention to reflect actual semantic and structural relationships (Kalibhat et al., 2024).
Inference-time expansion: With dynamic visual token scaling, the active set of VRTs can expand, e.g., by executing new visual modules as directed by a learned reasoning policy, providing dynamic and context-driven grounding (Bai et al., 8 Jun 2025).

4. Quantitative and Qualitative Evaluation

Interpretability and functional utility of VRTs are assessed through human and automated protocols:

Interpretability metrics (LatentLens): For a diverse suite of VLMs and layers, interpretability of a visual token is measured as the fraction for which at least one top-5 VRT is judged (by GPT-4o) as concrete/abstract/global. LatentLens yields ~72% interpretable tokens across all layers and models, compared to 30% (EmbeddingLens) and 23% (LogitLens) (Krojer et al., 31 Jan 2026). A notable finding is the alignment of visual token projections with mid-layer language representations ( $v_i$ 9- $i$ 0), not input embedding space.
Compositionality and retrieval benchmarks: Patch- and object-based VRTs yield significant performance improvements on canonical tasks. Tangible/intangible VRTs coupled with relational attention raise text-to-image retrieval R@1 by +47% and compositional benchmarks (ARO, Winoground) by +18% and +10% over standard ViTs (Kalibhat et al., 2024).
Grounded reasoning gains: Integration of point-and-copy VRTs in multimodal reasoning chains boosts arithmetic, logic, and visual math benchmarks by up to +10.9 points absolute (MathVision-mini) compared to text-only or static representations, and matches the performance of much larger models when given sufficiently supervised grounding data (Chung et al., 24 May 2025).
Referring and grounding accuracy: Discrete VRT-based models such as ClawMachine and PaDT achieve high region comprehension and segmentation scores—e.g., PaDT (3B) achieves 90.0% REC (RefCOCO(+/g)) and 73.4% cIoU, outperforming prior LLM-based methods and matching or exceeding massive baselines (e.g., InternVL3, 78B) (Su et al., 2 Oct 2025, Ma et al., 2024).
Machine translation improvements: Masking visually grounded tokens (VRTs) and training models to recover them increases BLEU (up to +1.2) and disambiguation score (CoMMuTE +0.12) over standard pipelines, with the highest gains for NLP-based token detection strategies (Bowen et al., 2024).

5. Practical Applications and Downstream Utility

VRTs are leveraged for a range of downstream tasks:

Application Area	VRT Role	Example Papers
Interpretability	Surface latent visual semantics as language	(Krojer et al., 31 Jan 2026)
Image/text retrieval	Semantically meaningful token alignment	(Kalibhat et al., 2024)
Referring/grounding	Emit/recover VRT sequences for object regions	(Ma et al., 2024, Su et al., 2 Oct 2025)
Multimodal reasoning	Chain-of-thought with persistent visual grounding	(Chung et al., 24 May 2025, Bai et al., 8 Jun 2025)
Hallucination detection	Check for mapping to concrete/abstract VRTs	(Krojer et al., 31 Jan 2026)
Machine Translation	Mask/recover visually-grounded words (VRTs)	(Bowen et al., 2024)

A key implication is that by interleaving or selectively emitting VRTs alongside/within language, models maintain visual grounding throughout complex reasoning, avoid hallucination, and enable robust, interpretable output on tasks ranging from segmentation to multimodal chain-of-thought.

6. Limitations, Ablations, and Open Issues

Several ablation studies and constraint observations illuminate outstanding limitations:

Connector architecture: In LatentLens, switching from a 3-layer MLP to a linear connector for vision-to-language projections yields only marginal interpretability changes (~0.8%), but alters the qualitative overlap of top- $i$ 1 VRTs (~2/5 overlap) (Krojer et al., 31 Jan 2026).
Corpus design: Single-sentence vs. multi-sentence textual reference corpora and the choice of captions vs. synthetic text moderately impact VRT coverage and interpretability.
Training objectives: Unfreezing the LLM backbone during joint vision-language training yields clear interpretability improvements (+6.4%), while switching to non-linguistic objectives sharply reduces VRT informativeness (–30%) (Krojer et al., 31 Jan 2026).
Ablation of relational attention: Removing additive attention mechanisms degrades compositional reasoning and order-sensitive tasks, indicating the importance of graph-structured biases in meaningful VRT systems (Kalibhat et al., 2024).
Dynamic scaling: Removing the verifier or the dynamic VRT scaling component in iterative frameworks causes 5–7 point drops in reasoning accuracy, illustrating the necessity of agency and selective attention (Bai et al., 8 Jun 2025).
Codebook, patch, and mixing strategies: VRT-based models are sensitive to codebook size, patch selection, and the training schedule (continuous vs. discrete), with best results from hybrid (mixed) schedules and select-and-merge redundancy removal (Ma et al., 2024).

7. Broader Implications and Future Directions

The continued evolution of VRT methodologies signals a shift from opaque, globally patchified vision-language representations to transparent, actionable, and highly structured visual reasoning:

Unified multimodal vocabularies: By embedding vision tokens as “words,” models approach a natively multimodal representation, enabling vision tasks to be handled under the same auto-regressive paradigm as text (Ma et al., 2024, Su et al., 2 Oct 2025, Chung et al., 24 May 2025).
Generalization beyond vision: The VRT neighbor search pipeline in LatentLens provides a generic recipe for surfacing semantics of soft prompts, speech embeddings, or arbitrary latent vectors by mapping to contextual neighbors, broadening the scope for cross-modal interpretability (Krojer et al., 31 Jan 2026).
Dynamic token/action scaling: Explicitly modeling token scaling as an MDP, with a verifier to adaptively decide when to acquire more visual evidence, supports fine-grained reasoning aligned with human perception (Bai et al., 8 Jun 2025).
Structured relational reasoning: The injection of graph-structured relations into self-attention via VRT-derived attention biases approximates GNN-like relational reasoning within transformers, boosting compositional and relational benchmarks (Kalibhat et al., 2024).
End-to-end and real-time scenarios: Open research avenues include scaling VRT extraction to very large datasets, integrating end-to-end trainable extraction modules, optimizing efficiency for real-time deployment, expanding the suite of visual modules, and refining isomorphisms between vision and language latent spaces.

Taken together, Visual Reference Tokens provide a unifying abstraction for interpretable, grounded, and compositional vision-language reasoning, with broad applications across interpretability analytics, retrieval, description, captioning, segmentation, and multi-step cognitive tasks. They serve as an extensible interface for dynamic, context-aware visual grounding, substantially closing the gap between static image patching and rich, actionable perceptual representation (Krojer et al., 31 Jan 2026, Kalibhat et al., 2024, Ma et al., 2024, Su et al., 2 Oct 2025, Chung et al., 24 May 2025, Bai et al., 8 Jun 2025, Bowen et al., 2024).