Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

Published 15 Apr 2026 in cs.CV | (2604.14041v2)

Abstract: Daily scenarios are characterized by visual richness, requiring Multimodal LLMs (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents DailyClue, a benchmark designed to isolate and evaluate visual clue-driven reasoning capabilities in MLLMs across everyday scenarios.
It demonstrates that even state-of-the-art models rarely surpass 60% accuracy, highlighting deficiencies in evidence extraction and chain-of-thought reasoning.
The study emphasizes the necessity for clue-grounded training and reveals the current limitations of agentic tool integration in advancing multimodal reasoning.

Seek-and-Solve: Benchmarking Visual Clue-Driven Reasoning in Multimodal LLMs

Motivation and Benchmark Design

The paper presents DailyClue, a benchmark aimed at isolating and rigorously evaluating the visual clue-driven reasoning capabilities of Multimodal LLMs (MLLMs). Existing VQA and multimodal benchmarks primarily assess perception and factual retrieval, with reasoning tasks often confounded by reliance on memorized priors or direct textual cues. DailyClue directly targets the bottleneck of visual reasoning: models must actively filter rich, noisy daily-life scenes to identify decisive visual evidence and employ it within a chain-of-thought (CoT) process.

The benchmark features four core domains—location identification, spatial relationship reasoning, daily commonsense reasoning, and scientific commonsense reasoning—spanning sixteen subtasks and 666 curated question-clue-answer triplets. Generation is model-assisted but strictly human-verified, and filtering employs consensus rejection among multiple top-tier MLLMs and rigorous manual review. The difficulty is calibrated such that all questions require deduction via visual clues rather than superficial inspection or world knowledge.

Figure 1: DailyClue's hierarchical taxonomy, scenario distribution, and data sources, illustrating full coverage across daily reasoning subtasks.

Construction Pipeline and Task Examples

The pipeline comprises three stages: (1) image collection from diverse sources, prioritizing visual richness and common daily scenarios; (2) GPT-5 and Gemini-2.5-Pro are used to synthesize candidate triplets, with prompt engineering ensuring requirement for indirect reasoning; (3) filtering by peer model consensus and expert annotators, rejecting trivial and hallucinated samples.

In Spatial Relationship Reasoning, system/user prompts target indirect occlusion or motion questions. For Scientific Commonsense, questions are designed around scenarios where external domain knowledge is insufficient and deduction requires precise visual semantic extraction. A subset of qualitative examples underscores that even tasks trivial for humans pose significant reasoning challenges for MLLMs due to the need for clue localization and integration.

Figure 2: DailyClue construction pipeline detailing staged data sourcing and filtering.

Figure 3: DailyClue scenario examples highlighting non-trivial reasoning tasks and visual complexity.

Experimental Results and Analysis

Twenty-five MLLMs (open, closed, and agentic/tool-based) are benchmarked, alongside a human baseline (undergraduate annotators), employing both General and Rigorous Evaluation Protocols. The latter demands not only answer correctness but explicit intersection between model-extracted and ground-truth visual clues.

Key findings:

Challenge and Discriminative Power: No model surpasses 60% overall accuracy; Gemini-2.5-Pro achieves state-of-the-art (56.90%), while the top open model, Qwen3-VL-235B-A22B-Thinking, reaches 44.59%. Model size scaling is strongly predictive of performance. Accuracy remains unsaturated (>90%) for all models, confirming benchmark difficulty and discriminative utility.
Agentic Models Lag: Despite tool integration (zooming, code execution), agents fine-tuned on open models exhibit negligible improvement, suggesting that tool-use does not currently compensate for intrinsic visual clue extraction limitations.
Human-Machine Comparison: Humans outperform MLLMs in spatial intuition (70.67% vs. 61.35% for Gemini-2.5-Pro) but lag in knowledge-intensive domains (e.g., location identification), primarily due to encyclopedic deficits.
Figure 4: Side-by-side accuracy comparison under General and Rigorous Evaluation Protocols.

Visual Clue Injection and Chain-of-Thought Ablations

Explicit injection of high-quality visual clues markedly improves reasoning accuracy across all models. Conditioning Gemini-2.5-Pro and Claude-3.7 on ground-truth clues increases accuracy by +1.65% and +14.86%, respectively. However, conditioning on inferior clues (e.g., from Qwen2.5-VL-72B) degrades performance by ~4%, demonstrating textual bias and visual sycophancy: MLLMs favor injected textual context over intrinsic visual perception, which may induce incorrect reasoning trajectories.

Chain-of-thought ablation reveals that enforcing active clue extraction within reasoning—rather than vanilla CoT or direct answering—consistently boosts accuracy. Clue-guided CoT acts as an anchoring constraint, mitigating reasoning drift and hallucination.

Figure 5: Answer generation under different clue contexts, illustrating the impact of clue source quality.

Figure 6: Visual clue-guided CoT reasoning consistently improves accuracy across MLLMs.

Rigorous Evaluation Findings

Rigorous Evaluation underlines the instability of reasoning pathways: models often arrive at correct answers via illusionary or irrelevant clues, i.e., "right for the wrong reason." Accuracy drops are modest for top models (Gemini-2.5-Pro: -0.44%), but more substantial for others (Qwen2.5-VL-72B: -1.95%), indicating residual reliance on superficial or confounded evidence. Qualitative visualization reaffirms that Rigorous Evaluation is essential for distinguishing genuine reasoning from random guessing.

Figure 7: Visualizations of clues used by MLLMs under rigorous protocol, exposing illusionary and useless reasoning anchors.

Implications and Future Directions

DailyClue exposes both strengths and persistent deficiencies in multimodal reasoning. Accurate visual clue extraction is presently the primary bottleneck, with even large proprietary models (e.g., Gemini-2.5-Pro, GPT-5) failing to saturate benchmark accuracy. The findings emphasize:

Necessity for Clue-Grounded Training: Further pretraining and fine-tuning explicitly targeting clue grounding, and architectural innovations that reduce textual bias and sycophancy, are required.
Agentic Integration Limits: Current tool-use augmentation does not sufficiently overcome clue extraction weaknesses; future multimodal agents must improve active visual exploration, multi-step evidence aggregation, and adaptive tool employment in complex daily scenarios.
Benchmark Utility: DailyClue's moderate scale, rich scenario complexity, and rigorous construction make it a discriminative platform for cross-family and cross-paradigm evaluation.

Future research should expand scenario diversity, volume, and incorporate dynamic video-based reasoning. Addressing textual bias, evidence hallucination, and compositional reasoning fidelity will be pivotal for next-generation multimodal agentic intelligence.

Conclusion

DailyClue establishes a new standard for clue-driven visual reasoning evaluation in daily scenarios, revealing substantial gaps in current MLLM performance, particularly in evidence localization. Benchmark results clearly demonstrate that robust reasoning in complex environments is contingent upon accurate clue identification and grounded CoT processes; improving these mechanisms should be a central focus for forthcoming multimodal model research and development.

Markdown Report Issue