Papers
Topics
Authors
Recent
Search
2000 character limit reached

ClueTracer: Suppressing Hallucinations in MLRMs

Updated 9 February 2026
  • ClueTracer is a plugin that suppresses hallucinations by localizing key visual evidence to counteract reasoning drift in multimodal models.
  • It employs a training-free attention tracing method using token variance, trace scores, and DBSCAN clustering to confine focus to task-relevant regions.
  • The approach improves both reasoning and non-reasoning models, yielding significant gains on benchmarks like HallusionBench and VMCBench without modifying model parameters.

ClueTracer is a training-free, parameter-free, architecture-agnostic plugin for suppressing hallucination in multimodal reasoning models by tracing task-relevant clues through question, output, and visual attention pathways. Developed to address the phenomenon of "reasoning drift"—wherein large multimodal language and reasoning models (MLRMs) lose grounding during multi-step inference—ClueTracer localizes decisive visual evidence and constrains model attention to suppress the generation of unsupported content. It improves both reasoning and non-reasoning MLLMs on a variety of established benchmarks, without modifying model parameters or requiring any additional training (Xi et al., 2 Feb 2026).

1. Reasoning Drift and the Motivation for ClueTracer

Reasoning drift denotes a failure mode unique to long-chain inference in MLRMs: during progressive reasoning, attention diffuses from a concise set of task-relevant visual regions Rrel\mathcal{R}_{\mathrm{rel}} toward a broader set of irrelevant patches Rirr\mathcal{R}_{\mathrm{irr}}, causing the model's answers to decouple from the true visual grounding. Mathematically, for cross-attention weights At,l,iA_{t,l,i} of token ii at decoding step tt and layer ll, the instantaneous attention on relevant and irrelevant regions is:

g(t)=vVrelAt,L,v,d(t)=vVirrAt,L,vg(t) = \sum_{v\in \mathcal{V}_{\mathrm{rel}}} A_{t,L^*,v}, \quad d(t) = \sum_{v\in \mathcal{V}_{\mathrm{irr}}} A_{t,L^*,v}

Reasoning drift is characterized by a monotonic decrease in 1Tt=1Tg(t)\frac{1}{T}\sum_{t=1}^T g(t) and an increase in 1Tt=1Td(t)\frac{1}{T}\sum_{t=1}^T d(t) as reasoning progresses. This failure mode is particularly acute for tasks where the inference chain involves multiple intermediate steps, each potentially introducing spurious attention to non-evidential regions (Xi et al., 2 Feb 2026).

2. ClueRecall: Measuring Visual Clue Retrieval

To quantify a model’s retrieval of question-relevant evidence, the ClueRecall metric is introduced. For a dataset Xperc\mathcal{X}_{\mathrm{perc}} of perception-labeled instances with known ground-truth bounding boxes (visual patches) and object categories, ClueRecall at layer ll is defined as:

ClueRecall(l)=1Xperc(Xc,Xv,Xbbox,cat)TopKXbbox(t:yt=catAt,l,V)XbboxXbbox\mathrm{ClueRecall}(l) = \frac{1}{|\mathcal{X}_{\mathrm{perc}}|} \sum_{(\mathbf{X}_c,\mathbf{X}_v,\mathbf{X}_{\mathrm{bbox}},\mathrm{cat})} \frac{\left|\operatorname{TopK}_{|\mathbf{X}_{\mathrm{bbox}}|}\left(\sum_{t:y_t=\mathrm{cat}}\mathbf{A}_{t,l,\mathcal{V}}\right) \cap \mathbf{X}_{\mathrm{bbox}}\right|}{|\mathbf{X}_{\mathrm{bbox}}|}

The maximally informative layer Lmax=argmaxlClueRecall(l)L_{\max} = \arg\max_l \mathrm{ClueRecall}(l) is selected for all subsequent attention tracing. This provides an empirical method for identifying the attention layer at which evidence localization is most accurate (Xi et al., 2 Feb 2026).

3. The ClueTracer Algorithm and Plugin Workflow

ClueTracer operates entirely at inference time and requires only access to model cross-attention tensors. Its workflow is summarized as follows:

  • (Initialization): The model is run on the input image and question; attention maps at LmaxL_{\max} are collected.
  • Question Token Selection: For every question token, the variance along the output axis is computed. Tokens with variance above a threshold τq\tau_q are selected as influencing question tokens Xq\mathcal{X}_q^\star.
  • Tracing Visual Tokens: For each visual token vv, the "trace score" T(v)T(v) is computed by aggregating the flow of attention from Xq\mathcal{X}_q^\star through each output to vv.
  • Region Localization: Visual tokens with trace scores above threshold τv\tau_v are spatially clustered (DBSCAN) into tight regions, which are then mapped to rectangular evidence patches R\mathcal{R}.
  • Attention Suppression: All non-selected visual tokens are masked in cross-attention; attention is renormalized so that future model outputs are primarily grounded in R\mathcal{R}.

This procedure is realized in Algorithm 1 (as presented verbatim in (Xi et al., 2 Feb 2026)), and is modular for integration into any decoder-style MLRM.

4. Practical Characteristics: Training-Free, Parameter-Free, Architecture-Agnostic

ClueTracer imposes no additional parameterization or fine-tuning requirements. It leverages only native model attention activations and standard interfaces (e.g., attention hooks, region selection). The plugin can be directly applied to a range of architectures, including but not limited to R1-OneVision, Ocean-R1, MM-Eureka, LLaVA-1.6-Mistral, and Qwen2.5-VL. Because attention masks and region crops are universal mechanisms, ClueTracer is compatible with both reasoning-focused and non-reasoning MLLMs (Xi et al., 2 Feb 2026).

5. Quantitative and Qualitative Impact

On reasoning benchmarks (HallusionBench, VMCBench), ClueTracer provides an average 1.21×1.21\times gain in answer accuracy. For example, R1-OneVision's accuracy on HallusionBench increases from 35.4% to 58.9%, and on VMCBench from 53.0% to 73.9%. Non-reasoning models also see notable improvements (e.g., LLaVA-1.6 from 25.0% to 44.5% on MMVP).

Model HallusionBench (aAcc) w/o CT HallusionBench (aAcc) w/ CT VMCBench Overall w/o CT VMCBench Overall w/ CT Avg Δ (%)
R1-OneVision 35.4% 58.9% 53.0% 73.9% +22.2
Ocean-R1 48.9% 63.5% 73.8% 80.3% +10.6
MM-Eureka 58.5% 65.1% 72.4% 80.3% +7.3
Orsta-R1 55.5% 60.1% 68.4% 78.4% +7.3

Case studies further illustrate the mechanism: in tasks such as identifying if a batter is wearing a helmet or detecting a mouse on a desk, ClueTracer eliminates reasoning drift by pinning attention to the relevant image patches, correcting hallucinated responses (Xi et al., 2 Feb 2026).

6. Limitations and Potential Extensions

ClueTracer’s effectiveness is conditioned on sufficiently grounded model attention maps. If the model's native attention is diffuse or inaccurate, evidence patches may not reliably correspond to the true context. The method is also sensitive to thresholds τq\tau_q, τv\tau_v, and clustering hyperparameters; these influence the precision-recall tradeoff in visual patch selection. For abstract queries lacking direct correspondence to visual tokens, evidence localization may fail. Potential enhancements include adaptive (e.g., learned) thresholding, tracing across multiple heads or layers, end-to-end region proposal integration, and extension to video or multi-view domains (Xi et al., 2 Feb 2026).

7. Conceptual Parallels to Textual Reasoning: The DetectBench Connection

A parallel exists between ClueTracer’s visual clue tracing and the textual detective reasoning pipelines developed for DetectBench and the Detective Thinking Framework. Both approaches decompose reasoning into sequential phases: (1) evidence extraction, (2) relation inference, (3) answer synthesis, and (4) evidence justification. In the textual case, this involves extracting and chaining context spans (Gu et al., 2023); in the multimodal case, ClueTracer routes attention through token-output-visual layers to anchor predictions in canonical evidence patches. Both demonstrate that the principal bottleneck is reliable identification and grounding of key clues. This suggests cross-modal transferability of clue aggregation and reasoning strategies (Gu et al., 2023, Xi et al., 2 Feb 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ClueTracer.