ClueTracer: Suppressing Hallucinations in MLRMs
- ClueTracer is a plugin that suppresses hallucinations by localizing key visual evidence to counteract reasoning drift in multimodal models.
- It employs a training-free attention tracing method using token variance, trace scores, and DBSCAN clustering to confine focus to task-relevant regions.
- The approach improves both reasoning and non-reasoning models, yielding significant gains on benchmarks like HallusionBench and VMCBench without modifying model parameters.
ClueTracer is a training-free, parameter-free, architecture-agnostic plugin for suppressing hallucination in multimodal reasoning models by tracing task-relevant clues through question, output, and visual attention pathways. Developed to address the phenomenon of "reasoning drift"—wherein large multimodal language and reasoning models (MLRMs) lose grounding during multi-step inference—ClueTracer localizes decisive visual evidence and constrains model attention to suppress the generation of unsupported content. It improves both reasoning and non-reasoning MLLMs on a variety of established benchmarks, without modifying model parameters or requiring any additional training (Xi et al., 2 Feb 2026).
1. Reasoning Drift and the Motivation for ClueTracer
Reasoning drift denotes a failure mode unique to long-chain inference in MLRMs: during progressive reasoning, attention diffuses from a concise set of task-relevant visual regions toward a broader set of irrelevant patches , causing the model's answers to decouple from the true visual grounding. Mathematically, for cross-attention weights of token at decoding step and layer , the instantaneous attention on relevant and irrelevant regions is:
Reasoning drift is characterized by a monotonic decrease in and an increase in as reasoning progresses. This failure mode is particularly acute for tasks where the inference chain involves multiple intermediate steps, each potentially introducing spurious attention to non-evidential regions (Xi et al., 2 Feb 2026).
2. ClueRecall: Measuring Visual Clue Retrieval
To quantify a model’s retrieval of question-relevant evidence, the ClueRecall metric is introduced. For a dataset of perception-labeled instances with known ground-truth bounding boxes (visual patches) and object categories, ClueRecall at layer is defined as:
The maximally informative layer is selected for all subsequent attention tracing. This provides an empirical method for identifying the attention layer at which evidence localization is most accurate (Xi et al., 2 Feb 2026).
3. The ClueTracer Algorithm and Plugin Workflow
ClueTracer operates entirely at inference time and requires only access to model cross-attention tensors. Its workflow is summarized as follows:
- (Initialization): The model is run on the input image and question; attention maps at are collected.
- Question Token Selection: For every question token, the variance along the output axis is computed. Tokens with variance above a threshold are selected as influencing question tokens .
- Tracing Visual Tokens: For each visual token , the "trace score" is computed by aggregating the flow of attention from through each output to .
- Region Localization: Visual tokens with trace scores above threshold are spatially clustered (DBSCAN) into tight regions, which are then mapped to rectangular evidence patches .
- Attention Suppression: All non-selected visual tokens are masked in cross-attention; attention is renormalized so that future model outputs are primarily grounded in .
This procedure is realized in Algorithm 1 (as presented verbatim in (Xi et al., 2 Feb 2026)), and is modular for integration into any decoder-style MLRM.
4. Practical Characteristics: Training-Free, Parameter-Free, Architecture-Agnostic
ClueTracer imposes no additional parameterization or fine-tuning requirements. It leverages only native model attention activations and standard interfaces (e.g., attention hooks, region selection). The plugin can be directly applied to a range of architectures, including but not limited to R1-OneVision, Ocean-R1, MM-Eureka, LLaVA-1.6-Mistral, and Qwen2.5-VL. Because attention masks and region crops are universal mechanisms, ClueTracer is compatible with both reasoning-focused and non-reasoning MLLMs (Xi et al., 2 Feb 2026).
5. Quantitative and Qualitative Impact
On reasoning benchmarks (HallusionBench, VMCBench), ClueTracer provides an average gain in answer accuracy. For example, R1-OneVision's accuracy on HallusionBench increases from 35.4% to 58.9%, and on VMCBench from 53.0% to 73.9%. Non-reasoning models also see notable improvements (e.g., LLaVA-1.6 from 25.0% to 44.5% on MMVP).
| Model | HallusionBench (aAcc) w/o CT | HallusionBench (aAcc) w/ CT | VMCBench Overall w/o CT | VMCBench Overall w/ CT | Avg Δ (%) |
|---|---|---|---|---|---|
| R1-OneVision | 35.4% | 58.9% | 53.0% | 73.9% | +22.2 |
| Ocean-R1 | 48.9% | 63.5% | 73.8% | 80.3% | +10.6 |
| MM-Eureka | 58.5% | 65.1% | 72.4% | 80.3% | +7.3 |
| Orsta-R1 | 55.5% | 60.1% | 68.4% | 78.4% | +7.3 |
Case studies further illustrate the mechanism: in tasks such as identifying if a batter is wearing a helmet or detecting a mouse on a desk, ClueTracer eliminates reasoning drift by pinning attention to the relevant image patches, correcting hallucinated responses (Xi et al., 2 Feb 2026).
6. Limitations and Potential Extensions
ClueTracer’s effectiveness is conditioned on sufficiently grounded model attention maps. If the model's native attention is diffuse or inaccurate, evidence patches may not reliably correspond to the true context. The method is also sensitive to thresholds , , and clustering hyperparameters; these influence the precision-recall tradeoff in visual patch selection. For abstract queries lacking direct correspondence to visual tokens, evidence localization may fail. Potential enhancements include adaptive (e.g., learned) thresholding, tracing across multiple heads or layers, end-to-end region proposal integration, and extension to video or multi-view domains (Xi et al., 2 Feb 2026).
7. Conceptual Parallels to Textual Reasoning: The DetectBench Connection
A parallel exists between ClueTracer’s visual clue tracing and the textual detective reasoning pipelines developed for DetectBench and the Detective Thinking Framework. Both approaches decompose reasoning into sequential phases: (1) evidence extraction, (2) relation inference, (3) answer synthesis, and (4) evidence justification. In the textual case, this involves extracting and chaining context spans (Gu et al., 2023); in the multimodal case, ClueTracer routes attention through token-output-visual layers to anchor predictions in canonical evidence patches. Both demonstrate that the principal bottleneck is reliable identification and grounding of key clues. This suggests cross-modal transferability of clue aggregation and reasoning strategies (Gu et al., 2023, Xi et al., 2 Feb 2026).