Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

RealX-Bench for Agentic Multimodal Reasoning

Updated 10 November 2025
  • RealX-Bench is a comprehensive benchmark that assesses agentic multimodal reasoning by integrating visual perception, dynamic knowledge retrieval, and logical reasoning.
  • It challenges models with tasks requiring fine-grained image analysis, multi-hop web searches, and multi-step synthesis to deliver precise, programmatically verifiable answers.
  • Empirical results highlight significant performance gaps between current MLLMs and human-level integration, emphasizing the need for improved autonomous tool invocation.

RealX-Bench is a comprehensive benchmark designed to evaluate real-world agentic multimodal reasoning, targeting scenarios that require the autonomous integration of visual perception, dynamic knowledge retrieval, and multi-step logical reasoning. It was introduced to assess the limitations of modern multimodal large-LLMs (MLLMs) in tasks characteristic of real-world problem solving, such as dynamically invoking external tools (e.g., code execution, web search) and synthesizing heterogeneous information for actionable answers (Hong et al., 7 Nov 2025).

1. Motivation and Core Objectives

RealX-Bench addresses deficits observed in existing benchmarks for MLLMs, which typically assess models in isolated settings—only perception, retrieval, or reasoning, but not their orchestration. Contemporary MLLMs can perform discrete tasks (e.g., generating answers from text, classifying images) but lack agentic autonomy: the ability to initiate sub-tasks like region cropping, numerical computation, or evidence retrieval as required by the problem context. RealX-Bench was created to impose rigorous and realistic requirements, compelling models to seamlessly coordinate:

  • Fine-grained visual perception (identifying small or occluded regions);
  • Dynamic knowledge retrieval (performing multi-hop web searches on both images and text);
  • Complex, multi-step logical reasoning (synthesizing cross-modal evidence and verifying intermediate hypotheses).

The benchmark’s design principles emphasize challenge level (each example is difficult along ≥1 dimension), real-world diversity (example sources: daily life, media, sports, factual knowledge, and games), and objective verifiability (all answers admit a unique, short, programmatically checkable response).

2. Dataset Construction and Task Taxonomy

RealX-Bench comprises 300 image-question pairs sampled from five everyday domains: Daily Life, Media, Sports, Factual Knowledge, and Games. Each instance is annotated along three orthogonal difficulty dimensions:

  1. Perception-challenging: Tasks that stress fine visual acuity, such as reading small text or isolating targets in cluttered images.
  2. Search-challenging: Queries demanding the retrieval of external, non-trivial knowledge (e.g., historical facts, rare terminology) via web search or similar interfaces.
  3. Reasoning-challenging: Problems that require multi-hop deduction, synthesis of intermediate results, or explicit logical explanation.

Approximately 24% of examples are "Integration" tasks, necessitating the coordinated use of all three abilities within a single trajectory.

Task Type Breakdown

Subset Number of Examples
Perception 87
Search 92
Reasoning 68
Integration 72

(Counts are overlapping: examples may fall under multiple categories.)

3. Evaluation Protocol and Metrics

RealX-Bench adopts a strictly quantitative evaluation, using accuracy as its sole metric. Each question has a unique, short-form answer enclosed in <answer>...</answer> tags to support automated verification. Let Nall=300N_{\mathrm{all}}=300 and Nperc,Nsearch,Nre,NintN_{\mathrm{perc}}, N_{\mathrm{search}}, N_{\mathrm{re}}, N_{\mathrm{int}} be the sizes of the perception, search, reasoning, and integration subsets. For each example ii, with predicted answer y^i\hat{y}_i and ground truth yiy_i:

  • Accuracyall=1Nalli=1Nall1(y^i=yi)\text{Accuracy}_\mathrm{all} = \frac{1}{N_{\mathrm{all}}} \sum_{i=1}^{N_{\mathrm{all}}} \mathbf{1}(\hat{y}_i = y_i)
  • Category-wise accuracy is similarly defined over each subset.

The primary “Average” performance in comparative studies corresponds to Accuracyall\text{Accuracy}_\mathrm{all}.

4. Benchmark Instances and Example Tasks

RealX-Bench is exemplified by questions that stress distinct agentic capabilities. Representative cases include:

  • Perception: “Identify the last digit on the license plate of the blue car at the far right.” (crowded street scene)
  • Search: “What year was the building, shown here, completed?” (photograph of a plaque)
  • Reasoning: “Between April and September, did the average monthly sales increase or decrease?” (chart image)
  • Integration: Multi-stage (e.g., “Crop the central bloom, identify flower genus via image search, retrieve common name, and explain why petals are translucent.”)

Each problem requires the model to formulate a minimal, precise answer, reflecting the outcome of potentially complex tool invocation and multi-modal evidence gathering.

5. Experimental Results and Comparative Analysis

RealX-Bench is used to benchmark DeepEyesV2 and several state-of-the-art MLLMs. The following summarizes key comparative accuracies (in %):

Model Avg Perception Reasoning Search Integration
Qwen2.5-VL-7B 22.3 17.1 16.3 19.9 9.7
Qwen2.5-VL-32B 32.0 27.4 29.2 31.8 23.6
Gemini 2.5 Pro 46.0 41.5 33.7 43.6 27.8
DeepEyesV2 28.3 19.5 22.5 28.9 18.1
Human Performance 70.0 69.5 63.5 62.1 51.4

Relative to the 7B baseline, DeepEyesV2 achieves a +6.0 point improvement overall, with more pronounced gains (+10.0 pp in search, +8.4 pp in integration). However, there remains a substantial gap between machine and human performance, particularly on integration tasks (DeepEyesV2: 18.1%; human: 51.4%).

Observed Model Behaviors

  • On search challenges, DeepEyesV2 leverages both image and text-based web retrieval.
  • Reasoning tasks show moderate improvement due to tool-supported code execution (e.g., analytical computation).
  • Performance on fine-grained perception is limited, especially in tasks that require sub-image manipulation.
  • Tool invocation exhibits adaptive specificity: image-cropping for perception, code execution for reasoning, and multi-tool orchestration for integration queries. Post-reinforcement learning, frequency of tool calls decreases, but the diversity of combined tools increases, particularly for integration.

6. Limitations

Despite its comprehensiveness, RealX-Bench is subject to several constraints:

  • Scale: With 300 instances, statistical diversity is limited; there is potential for coverage gaps in real-world phenomena.
  • Static Modality: The benchmark operates exclusively on still image data and web search; temporal modalities (video, audio) are not included.
  • Tool Scope: Only elementary tools (e.g., image crop, code execute, standard web search) are admissible. Scenarios necessitating knowledge-base queries or advanced perceptual modules (e.g., OCR, face recognition) are out of scope.

This suggests that existing models and benchmarks may not fully reflect the breadth of agentic requirements encountered in complex real-world environments.

7. Future Directions

A plausible implication is that future benchmarking efforts will need to:

  • Extend Modalities: Incorporate dynamic media such as video and speech for temporal and auditory reasoning tasks.
  • Expand Tooling: Support richer environments, including access to knowledge graphs, database interaction, robotics simulations, and advanced perceptual APIs.
  • Increase Scale: Substantially broaden instance count and domain diversity for improved generalizability and resilience to overfitting.
  • Refine Task Complexity: Introduce more layers of necessary reasoning, adaptive planning, and interactive tool use to approach the sophistication of human problem-solving.

RealX-Bench is therefore positioned as an early, but significant step toward holistic evaluation of agentic multimodal models. Its adoption encourages the development of architectures capable of context- and task-adaptive tool invocation, bridging the gap between passive multimodal understanding and actionable, autonomous reasoning (Hong et al., 7 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RealX-Bench.