Papers
Topics
Authors
Recent
2000 character limit reached

Evidence Tokens in Multimodal Reasoning

Updated 6 January 2026
  • Evidence tokens are decoder choices that reference specific image patch embeddings, enabling direct perceptual grounding in multimodal reasoning.
  • They work by copying designated patch embeddings from flattened images, blending visual evidence into textual inferences via a differentiable pointer mechanism.
  • Empirical evaluations show that using evidence tokens significantly boosts performance, with improvements of over 10% in key multimodal reasoning benchmarks.

An evidence token, in the context of multimodal LLMs (MLLMs), is a decoder choice that facilitates dynamic, differentiable referencing of visual content during autoregressive reasoning. In the v1 architecture, an evidence token points to a specific patch embedding from the input image, which is then re-injected into the model’s reasoning stream. This mechanism directly integrates perceptual evidence—semantic embeddings of image regions—into ongoing chain-of-thought, supporting grounded multimodal mathematical reasoning and enabling flexible revisiting of relevant visual information throughout the inference process (Chung et al., 24 May 2025).

1. Formalization and Operational Role

Formally, given an input image processed into KK patch embeddings C={c1,,cK}C = \{c_1, \dots, c_K\}, each ckRDc_k \in \mathbb{R}^D, the decoder at each step tt chooses to output either (a) a vocabulary token wVw \in V or (b) a pointer token ptr:ck\langle\text{ptr}:c_k\rangle, termed an evidence token. Selecting the latter causes the patch embedding ckc_k to be injected as the subsequent input embedding, effectively merging visual evidence into textual reasoning. This provides direct perceptual grounding at each inference step, in contrast to conventional approaches where visual information is accessed only once at the beginning.

2. Model Architecture and Pointing Mechanism

The v1 extension utilizes two lightweight linear heads atop the transformer decoder’s last hidden state htRDh_t\in\mathbb{R}^D. One head produces generation logits over textual vocabulary VV, while the pointing head produces logits over the set of KK image patches. Input images are flattened into patch embeddings and processed through a key cache (Lk(ck)L_k(c_k)) and a value cache (ckc_k). During decoding, these logits are concatenated into a (V+K)(|V|+K)-way vector, followed by a softmax selection. Choosing a pointer index corresponds to the emission of an evidence token, which copies the selected patch embedding into the reasoning stream.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for t = 1…T do:
    h_t ← Transformer.decode_step(x_{t-1}, KV_cache)
    gen_logits ← W_gen·h_t            # size |V|
    ptr_logits ← (L_q·h_t) · [L_k(c_k)]ᵀ / sqrt(D)   # size K
    joint_logits ← concat(gen_logits, ptr_logits)  # size |V|+K
    i* ← argmax ή sample from softmax(joint_logits)
    if i* ≤ |V| then
        x_t ← vocab_token(i*)
        x_t_embedding ← embed(x_t)
    else
        k ← i* – |V|
        x_t ← ⟨ptr:c_k⟩           # evidence token selected
        x_t_embedding ← c_k      # copy patch embedding back in
    end if
    Append x_t_embedding to input sequence for next step
end for

3. Mathematical Formulation

The pointing mechanism is expressed via semantic key–value lookup. Given hidden state hth_t at step tt, the following WgenW_{\text{gen}}, LqL_q, and LkL_k denote trainable parameters:

  • Vocabulary logits:

logitgen=WgenhtRV\mathrm{logit}_{\mathrm{gen}} = W_{\mathrm{gen}} h_t \in \mathbb{R}^{|V|}

  • Pointing logits for each patch:

logitptr(k)=Lq(ht)(Lk(ck))D,k=1K\mathrm{logit}_{\mathrm{ptr}}^{(k)} = \frac{L_q(h_t)\,(L_k(c_k))^\top}{\sqrt D}, \quad \forall\,k=1\ldots K

  • Joint distribution:

logitt=[logitgen    logitptr]RV+K\mathrm{logit}_t = [\mathrm{logit}_{\mathrm{gen}}\;\|\;\mathrm{logit}_{\mathrm{ptr}}] \in \mathbb{R}^{|V|+K}

p(xt=i)=exp(logitt(i))jexp(logitt(j))p(x_t=i) = \frac{\exp(\mathrm{logit}_t^{(i)})}{\sum_j\exp(\mathrm{logit}_t^{(j)})}

Supervision combines cross-entropy loss on the augmented vocabulary with a zz-loss regularizer:

LCE=tlogp(xt)+λz(logZ)2,Z=jexp(logitt(j)),λz=105.\mathcal{L}_{\mathrm{CE}} = -\sum_t\log p(x_t^\star) + \lambda_{\mathrm{z}}(\log Z)^2 \,,\quad Z = \sum_j\exp(\mathrm{logit}_t^{(j)}),\quad \lambda_{\mathrm{z}}=10^{-5}.

4. Dataset Construction and Annotation (v1g)

To support learning of evidence token usage, the v1g dataset was constructed, comprising 300,000 multimodal reasoning traces with explicit grounding annotations. Generation proceeds in three stages:

  • Seed trace generation: Chain-of-thought answers are sampled from TVC-7B across nine domains (Charts, Documents, Geometry, IQ Tests, Medical, Natural Scenes, Science Diagrams, Synthetic, Tables).
  • Visual-query decomposition: Steps referencing images are rewritten with explicit detect(query, objects=[…]) calls and symbolic identifiers.
  • Attention-based grounding: Cross-attention maps from a frozen Qwen2.5-VL backbone compute bounding boxes via ratio-of-marginal-to-conditional-attention masks; filtering yields ~82% retention.

In training, each detect call is treated as a supervised pointer target, ensuring models emit evidence tokens at contextually appropriate moments.

Statistic Value
Total traces 300,000
Avg. reasoning length 6.2 steps
Avg. visual references 2.1 per trace
Domains 9 (∼33k traces each)
Patches pointed per trace 4.8
Bounding boxes per example Mean 3.5, σ=1.2

5. Empirical Performance and Ablations

On three multimodal mathematical reasoning benchmarks, v1’s evidence tokens delivered robust accuracy improvements over the Qwen2.5-VL backbone (7B parameters) and ablated variants. Table 1 summarizes accuracy across benchmarks, highlighting the contribution of point-and-copy:

Model M–Vista M–Vision mini M–Vision full M–Verse
Qwen2.5-VL (7B, base) 67.8 23.6 44.5
v1 (7B, full pointing) 68.6 34.5 28.1 48.6
no pointing at inf. 60.0 25.3 23.7 33.6

Notably, v1’s performance on MathVision mini rises from 23.6% to 34.5% (+10.9%), demonstrating substantive improvements attributable directly to evidence token usage. Suppressing evidence tokens during inference produces a distinct drop in accuracy, below even the unfine-tuned backbone. A coordinate-only variant (supervised on box coordinates without patch-copying) achieves 31.9%, indicating the superiority of patch-level semantic copy over mere coordinate supervision.

Qualitatively, v1 outputs chain-of-thought traces interleaving textual reasoning and explicit pointers (e.g., “detect(‘hexagon B’, …)… <ptr₇>…”), with downstream steps showing heightened attention to copied embeddings.

6. Contextual Significance and Implications

The evidence token mechanism in v1 provides a lightweight, end-to-end differentiable solution for continuous multimodal reasoning, allowing MLLMs to re-access and directly integrate perceptually grounded visual features. This approach substantially mitigates the drift typical in chain-of-thought reasoning over lengthy sequences, where conventional models lose focus on relevant image regions. A plausible implication is that evidence tokens may generalize as a practical risk-reduction strategy for continual grounding in multimodal settings, potentially benefiting future MLLM architectures that require dynamic visual referencing. The validation of patch-level semantic copy as superior to call-by-coordinate further suggests that careful semantic integration is crucial for effective grounded reasoning in complex inference tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Evidence Tokens.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube