Evidence Tokens in Multimodal Reasoning

Updated 6 January 2026

Evidence tokens are decoder choices that reference specific image patch embeddings, enabling direct perceptual grounding in multimodal reasoning.
They work by copying designated patch embeddings from flattened images, blending visual evidence into textual inferences via a differentiable pointer mechanism.
Empirical evaluations show that using evidence tokens significantly boosts performance, with improvements of over 10% in key multimodal reasoning benchmarks.

An evidence token, in the context of multimodal LLMs (MLLMs), is a decoder choice that facilitates dynamic, differentiable referencing of visual content during autoregressive reasoning. In the v1 architecture, an evidence token points to a specific patch embedding from the input image, which is then re-injected into the model’s reasoning stream. This mechanism directly integrates perceptual evidence—semantic embeddings of image regions—into ongoing chain-of-thought, supporting grounded multimodal mathematical reasoning and enabling flexible revisiting of relevant visual information throughout the inference process (Chung et al., 24 May 2025).

1. Formalization and Operational Role

Formally, given an input image processed into $K$ patch embeddings $C = \{c_1, \dots, c_K\}$ , each $c_k \in \mathbb{R}^D$ , the decoder at each step $t$ chooses to output either (a) a vocabulary token $w \in V$ or (b) a pointer token $\langle\text{ptr}:c_k\rangle$ , termed an evidence token. Selecting the latter causes the patch embedding $c_k$ to be injected as the subsequent input embedding, effectively merging visual evidence into textual reasoning. This provides direct perceptual grounding at each inference step, in contrast to conventional approaches where visual information is accessed only once at the beginning.

2. Model Architecture and Pointing Mechanism

The v1 extension utilizes two lightweight linear heads atop the transformer decoder’s last hidden state $h_t\in\mathbb{R}^D$ . One head produces generation logits over textual vocabulary $V$ , while the pointing head produces logits over the set of $K$ image patches. Input images are flattened into patch embeddings and processed through a key cache ( $L_k(c_k)$ ) and a value cache ( $c_k$ ). During decoding, these logits are concatenated into a $(|V|+K)$ -way vector, followed by a softmax selection. Choosing a pointer index corresponds to the emission of an evidence token, which copies the selected patch embedding into the reasoning stream.

for t = 1…T do:
    h_t ← Transformer.decode_step(x_{t-1}, KV_cache)
    gen_logits ← W_gen·h_t            # size |V|
    ptr_logits ← (L_q·h_t) · [L_k(c_k)]ᵀ / sqrt(D)   # size K
    joint_logits ← concat(gen_logits, ptr_logits)  # size |V|+K
    i* ← argmax ή sample from softmax(joint_logits)
    if i* ≤ |V| then
        x_t ← vocab_token(i*)
        x_t_embedding ← embed(x_t)
    else
        k ← i* – |V|
        x_t ← ⟨ptr:c_k⟩           # evidence token selected
        x_t_embedding ← c_k      # copy patch embedding back in
    end if
    Append x_t_embedding to input sequence for next step
end for

3. Mathematical Formulation

The pointing mechanism is expressed via semantic key–value lookup. Given hidden state $h_t$ at step $t$ , the following $W_{\text{gen}}$ , $L_q$ , and $L_k$ denote trainable parameters:

Vocabulary logits:

$\mathrm{logit}_{\mathrm{gen}} = W_{\mathrm{gen}} h_t \in \mathbb{R}^{|V|}$

Pointing logits for each patch:

$\mathrm{logit}_{\mathrm{ptr}}^{(k)} = \frac{L_q(h_t)\,(L_k(c_k))^\top}{\sqrt D}, \quad \forall\,k=1\ldots K$

Joint distribution:

$\mathrm{logit}_t = [\mathrm{logit}_{\mathrm{gen}}\;\|\;\mathrm{logit}_{\mathrm{ptr}}] \in \mathbb{R}^{|V|+K}$

$p(x_t=i) = \frac{\exp(\mathrm{logit}_t^{(i)})}{\sum_j\exp(\mathrm{logit}_t^{(j)})}$

Supervision combines cross-entropy loss on the augmented vocabulary with a $z$ -loss regularizer:

$\mathcal{L}_{\mathrm{CE}} = -\sum_t\log p(x_t^\star) + \lambda_{\mathrm{z}}(\log Z)^2 \,,\quad Z = \sum_j\exp(\mathrm{logit}_t^{(j)}),\quad \lambda_{\mathrm{z}}=10^{-5}.$

4. Dataset Construction and Annotation (v1g)

To support learning of evidence token usage, the v1g dataset was constructed, comprising 300,000 multimodal reasoning traces with explicit grounding annotations. Generation proceeds in three stages:

Seed trace generation: Chain-of-thought answers are sampled from TVC-7B across nine domains (Charts, Documents, Geometry, IQ Tests, Medical, Natural Scenes, Science Diagrams, Synthetic, Tables).
Visual-query decomposition: Steps referencing images are rewritten with explicit detect(query, objects=[…]) calls and symbolic identifiers.
Attention-based grounding: Cross-attention maps from a frozen Qwen2.5-VL backbone compute bounding boxes via ratio-of-marginal-to-conditional-attention masks; filtering yields ~82% retention.

In training, each detect call is treated as a supervised pointer target, ensuring models emit evidence tokens at contextually appropriate moments.

Statistic	Value
Total traces	300,000
Avg. reasoning length	6.2 steps
Avg. visual references	2.1 per trace
Domains	9 (∼33k traces each)
Patches pointed per trace	4.8
Bounding boxes per example	Mean 3.5, σ=1.2

5. Empirical Performance and Ablations

On three multimodal mathematical reasoning benchmarks, v1’s evidence tokens delivered robust accuracy improvements over the Qwen2.5-VL backbone (7B parameters) and ablated variants. Table 1 summarizes accuracy across benchmarks, highlighting the contribution of point-and-copy:

Model	M–Vista	M–Vision mini	M–Vision full	M–Verse
Qwen2.5-VL (7B, base)	67.8	23.6	—	44.5
v1 (7B, full pointing)	68.6	34.5	28.1	48.6
no pointing at inf.	60.0	25.3	23.7	33.6

Notably, v1’s performance on MathVision mini rises from 23.6% to 34.5% (+10.9%), demonstrating substantive improvements attributable directly to evidence token usage. Suppressing evidence tokens during inference produces a distinct drop in accuracy, below even the unfine-tuned backbone. A coordinate-only variant (supervised on box coordinates without patch-copying) achieves 31.9%, indicating the superiority of patch-level semantic copy over mere coordinate supervision.

Qualitatively, v1 outputs chain-of-thought traces interleaving textual reasoning and explicit pointers (e.g., “detect(‘hexagon B’, …)… <ptr₇>…”), with downstream steps showing heightened attention to copied embeddings.

6. Contextual Significance and Implications

The evidence token mechanism in v1 provides a lightweight, end-to-end differentiable solution for continuous multimodal reasoning, allowing MLLMs to re-access and directly integrate perceptually grounded visual features. This approach substantially mitigates the drift typical in chain-of-thought reasoning over lengthy sequences, where conventional models lose focus on relevant image regions. A plausible implication is that evidence tokens may generalize as a practical risk-reduction strategy for continual grounding in multimodal settings, potentially benefiting future MLLM architectures that require dynamic visual referencing. The validation of patch-level semantic copy as superior to call-by-coordinate further suggests that careful semantic integration is crucial for effective grounded reasoning in complex inference tasks.

PDF Markdown Chat (Pro)

References (1)

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Evidence Tokens.

Evidence Tokens in Multimodal Reasoning

1. Formalization and Operational Role

2. Model Architecture and Pointing Mechanism

3. Mathematical Formulation

4. Dataset Construction and Annotation (v1g)

5. Empirical Performance and Ablations

6. Contextual Significance and Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Evidence Tokens in Multimodal Reasoning

1. Formalization and Operational Role

2. Model Architecture and Pointing Mechanism

3. Mathematical Formulation

4. Dataset Construction and Annotation (v1g)

5. Empirical Performance and Ablations

6. Contextual Significance and Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research