Evidence Tokens in Multimodal Reasoning
- Evidence tokens are decoder choices that reference specific image patch embeddings, enabling direct perceptual grounding in multimodal reasoning.
- They work by copying designated patch embeddings from flattened images, blending visual evidence into textual inferences via a differentiable pointer mechanism.
- Empirical evaluations show that using evidence tokens significantly boosts performance, with improvements of over 10% in key multimodal reasoning benchmarks.
An evidence token, in the context of multimodal LLMs (MLLMs), is a decoder choice that facilitates dynamic, differentiable referencing of visual content during autoregressive reasoning. In the v1 architecture, an evidence token points to a specific patch embedding from the input image, which is then re-injected into the model’s reasoning stream. This mechanism directly integrates perceptual evidence—semantic embeddings of image regions—into ongoing chain-of-thought, supporting grounded multimodal mathematical reasoning and enabling flexible revisiting of relevant visual information throughout the inference process (Chung et al., 24 May 2025).
1. Formalization and Operational Role
Formally, given an input image processed into patch embeddings , each , the decoder at each step chooses to output either (a) a vocabulary token or (b) a pointer token , termed an evidence token. Selecting the latter causes the patch embedding to be injected as the subsequent input embedding, effectively merging visual evidence into textual reasoning. This provides direct perceptual grounding at each inference step, in contrast to conventional approaches where visual information is accessed only once at the beginning.
2. Model Architecture and Pointing Mechanism
The v1 extension utilizes two lightweight linear heads atop the transformer decoder’s last hidden state . One head produces generation logits over textual vocabulary , while the pointing head produces logits over the set of image patches. Input images are flattened into patch embeddings and processed through a key cache () and a value cache (). During decoding, these logits are concatenated into a -way vector, followed by a softmax selection. Choosing a pointer index corresponds to the emission of an evidence token, which copies the selected patch embedding into the reasoning stream.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
for t = 1…T do:
h_t ← Transformer.decode_step(x_{t-1}, KV_cache)
gen_logits ← W_gen·h_t # size |V|
ptr_logits ← (L_q·h_t) · [L_k(c_k)]ᵀ / sqrt(D) # size K
joint_logits ← concat(gen_logits, ptr_logits) # size |V|+K
i* ← argmax ή sample from softmax(joint_logits)
if i* ≤ |V| then
x_t ← vocab_token(i*)
x_t_embedding ← embed(x_t)
else
k ← i* – |V|
x_t ← ⟨ptr:c_k⟩ # evidence token selected
x_t_embedding ← c_k # copy patch embedding back in
end if
Append x_t_embedding to input sequence for next step
end for |
3. Mathematical Formulation
The pointing mechanism is expressed via semantic key–value lookup. Given hidden state at step , the following , , and denote trainable parameters:
- Vocabulary logits:
- Pointing logits for each patch:
- Joint distribution:
Supervision combines cross-entropy loss on the augmented vocabulary with a -loss regularizer:
4. Dataset Construction and Annotation (v1g)
To support learning of evidence token usage, the v1g dataset was constructed, comprising 300,000 multimodal reasoning traces with explicit grounding annotations. Generation proceeds in three stages:
- Seed trace generation: Chain-of-thought answers are sampled from TVC-7B across nine domains (Charts, Documents, Geometry, IQ Tests, Medical, Natural Scenes, Science Diagrams, Synthetic, Tables).
- Visual-query decomposition: Steps referencing images are rewritten with explicit
detect(query, objects=[…])calls and symbolic identifiers. - Attention-based grounding: Cross-attention maps from a frozen Qwen2.5-VL backbone compute bounding boxes via ratio-of-marginal-to-conditional-attention masks; filtering yields ~82% retention.
In training, each detect call is treated as a supervised pointer target, ensuring models emit evidence tokens at contextually appropriate moments.
| Statistic | Value |
|---|---|
| Total traces | 300,000 |
| Avg. reasoning length | 6.2 steps |
| Avg. visual references | 2.1 per trace |
| Domains | 9 (∼33k traces each) |
| Patches pointed per trace | 4.8 |
| Bounding boxes per example | Mean 3.5, σ=1.2 |
5. Empirical Performance and Ablations
On three multimodal mathematical reasoning benchmarks, v1’s evidence tokens delivered robust accuracy improvements over the Qwen2.5-VL backbone (7B parameters) and ablated variants. Table 1 summarizes accuracy across benchmarks, highlighting the contribution of point-and-copy:
| Model | M–Vista | M–Vision mini | M–Vision full | M–Verse |
|---|---|---|---|---|
| Qwen2.5-VL (7B, base) | 67.8 | 23.6 | — | 44.5 |
| v1 (7B, full pointing) | 68.6 | 34.5 | 28.1 | 48.6 |
| no pointing at inf. | 60.0 | 25.3 | 23.7 | 33.6 |
Notably, v1’s performance on MathVision mini rises from 23.6% to 34.5% (+10.9%), demonstrating substantive improvements attributable directly to evidence token usage. Suppressing evidence tokens during inference produces a distinct drop in accuracy, below even the unfine-tuned backbone. A coordinate-only variant (supervised on box coordinates without patch-copying) achieves 31.9%, indicating the superiority of patch-level semantic copy over mere coordinate supervision.
Qualitatively, v1 outputs chain-of-thought traces interleaving textual reasoning and explicit pointers (e.g., “detect(‘hexagon B’, …)… <ptr₇>…”), with downstream steps showing heightened attention to copied embeddings.
6. Contextual Significance and Implications
The evidence token mechanism in v1 provides a lightweight, end-to-end differentiable solution for continuous multimodal reasoning, allowing MLLMs to re-access and directly integrate perceptually grounded visual features. This approach substantially mitigates the drift typical in chain-of-thought reasoning over lengthy sequences, where conventional models lose focus on relevant image regions. A plausible implication is that evidence tokens may generalize as a practical risk-reduction strategy for continual grounding in multimodal settings, potentially benefiting future MLLM architectures that require dynamic visual referencing. The validation of patch-level semantic copy as superior to call-by-coordinate further suggests that careful semantic integration is crucial for effective grounded reasoning in complex inference tasks.