MWE-Bench: Multimodal Tool Reasoning Benchmark

Updated 4 July 2026

MWE-Bench is a multimodal benchmark designed to evaluate agentic, tool-integrated reasoning by combining image recognition with external retrieval for unique answers.
The benchmark covers six categories (Car, Animal, Plant, Person, Landmark, Sports) to stress multi-step reasoning and precise tool invocation decisions.
Empirical findings show that effective tool use significantly improves performance, with agents achieving up to 75.35 accuracy through interleaved tool calls.

Searching arXiv for the cited papers to ground the article in the specified literature. MindWatcher-Evaluate Bench (MWE-Bench) is a multimodal benchmark introduced within the MindWatcher system to evaluate agentic, tool-integrated reasoning (TIR) and multimodal chain-of-thought capabilities on image-text question answering tasks that require coordinated tool use in an interleaved thinking or ReAct-style paradigm (Chen et al., 29 Dec 2025). Its instances are constructed so that visual evidence and external knowledge are both necessary, with multi-step reasoning and retrieval required for a unique answer. Within the MindWatcher study, MWE-Bench functions simultaneously as an evaluation dataset, a diagnostic instrument for tool-call behavior, and the main benchmark used to assess both the 32B MindWatcher agent and distilled 2B, 3B, and 4B variants.

1. Definition, scope, and design rationale

MWE-Bench is defined as a benchmark for systematically evaluating agentic, tool-integrated reasoning under multimodal inputs. The benchmark is organized around image-text question answering tasks in which the answer cannot be obtained from the image alone or from raw parametric knowledge alone; instead, the task requires visual recognition, external retrieval, and multi-step reasoning over retrieved evidence (Chen et al., 29 Dec 2025). The six benchmark categories are Car, Animal, Plant, Person, Landmark, and Sports.

The stated motivation is that existing TIR and agent benchmarks are mostly text-only, are heavily oriented toward web search or DeepSearch-style tasks, rarely require fine-grained multimodal reasoning or image manipulation such as crop/zoom and region search, and often suffer from temporal leakage or ambiguous answers. The benchmark is therefore designed to emphasize vision-driven, tool-dependent tasks, with strict answer uniqueness and temporal consistency so that model-based judging and reinforcement-learning rewards remain reliable. The benchmark is explicitly targeted at agentic, multi-step tool invocation under an interleaved thinking paradigm rather than at direct-answer multimodal question answering alone (Chen et al., 29 Dec 2025).

This positioning distinguishes MWE-Bench from benchmarks centered primarily on static perception or text retrieval. A plausible implication is that MWE-Bench should be read not merely as a QA dataset but as a benchmarked environment for end-to-end policy quality in multimodal agents: perception, tool selection, retrieval sequencing, and final answer synthesis are jointly under evaluation.

2. Task structure, modalities, and benchmark composition

Each MWE-Bench instance is a multimodal QA problem built so that images and background knowledge are both needed, and tool use is necessary for a unique answer. The benchmark exercises image understanding, text retrieval, and planning over multi-step decisions about which tools to invoke, in what order, and how to combine the resulting evidence (Chen et al., 29 Dec 2025).

The category composition reported for the benchmark is as follows.

Category	Instances
Car	373
Animal	351
Plant	397
Person	63
Landmark	90
Sports	142

These categories jointly total approximately 1.4k instances. Car, Animal, and Plant emphasize fine-grained recognition coupled to external factual lookup; Person and Landmark require recognition of public figures or monuments followed by retrieval of career, historical, or geographic facts; Sports is derived from sports news and is explicitly time-anchored, with questions such as counts, rankings, and sponsorship facts valid at a specified date (Chen et al., 29 Dec 2025).

The input modality consists of at least one image together with a text question. The output is a textual answer, sometimes multi-part. For agentic systems, the reasoning process may interleave internal chain-of-thought, image operations, visual search, external text retrieval, webpage content extraction, and optional local code execution. The benchmark therefore combines a vision-to-text recognition stage with text-to-text retrieval and synthesis.

The published case studies illustrate the intended compositionality. A sports instance requires identifying an NBA player from an image, then retrieving endorsement news and current Nike signature-athlete counts. A landmark instance begins from a Mount Rushmore image and continues through historical retrieval about the presidents depicted and counter-espionage history. A device instance requires recognizing or inferring a product brand from an image and then synthesizing a multi-attribute functional description from web sources (Chen et al., 29 Dec 2025). These examples indicate that MWE-Bench systematically penalizes agents that either fail to trigger tool use or fail to chain multiple retrieval steps coherently.

3. Construction pipeline and data provenance

MWE-Bench is constructed separately from training data in order to avoid leakage, while preserving the same domain ontology as the MindWatcher training pipelines. For the five private-image-derived categories—Car, Animal, Plant, Person, and Landmark—the process begins from internal private image databases or knowledge entries that were excluded from training. The benchmark construction then expands each entity with auxiliary web-based information, uses closed-source LLMs for “uniqueness deconstruction” into core factual statements, constructs single-turn QA pairs, and synthesizes them into more complex multi-step reasoning tasks (Chen et al., 29 Dec 2025).

Verification includes both automated model-based filtering and manual expert review. The stated objectives of this stage are factual quality and temporal accuracy, so that answers remain valid at the benchmark’s reference time. The use of “uniqueness deconstruction” is central: atomic facts are intended as building blocks for unambiguous QA, reducing ambiguity in both supervision and evaluation.

The Sports subset follows a related but temporally disjoint pipeline based on open-sourced news. Articles are filtered to ensure non-empty bodies and at least one image, semantically audited so that only completed events with sufficient image-text alignment are retained, and then merged across entity- or event-consistent corpora from time points not overlapping with training. A powerful LLM extracts atomic facts, constraint-aware generation enforces temporal anchoring and visual-textual dependency, and subsequent cleaning yields the final 142 sports instances (Chen et al., 29 Dec 2025).

An important infrastructural dependency is the MindWatcher Multimodal Retrieval Database (MWRD), a local image retrieval corpus spanning eight categories—Person, Car, Plant, Animal, Logo, Landmark, Fruit & Vegetable, and Dish—with approximately 50k entities and about 300k images in total, curated to precision greater than 99% (Chen et al., 29 Dec 2025). MWRD is not directly the source of benchmark questions; rather, it is the visual knowledge backend used by agents when solving MWE-Bench through grounding and visual search. Because many benchmark items require identifying visually subtle entities such as rare plants or specific car trims, benchmark difficulty is partly mediated by the quality and breadth of this retrieval substrate.

This construction strategy places MWE-Bench between static dataset curation and environment design. The benchmark is dataset-like in that its instances are fixed and curated, but environment-like in that successful performance depends on interaction with a tool layer whose behavior materially affects outcomes.

4. Evaluation protocol, metrics, and tool environment

MWE-Bench evaluation is performed in a ReAct/Agent mode in which the model interacts with an environment using a strict action schema. An episode begins with the user question and associated image or images. The agent produces a sequence

$Y = \{a_0, obs_0, a_1, obs_1, \dots, a_n\},$

where each action $a_t$ is either a thought wrapped in > ... or a tool call wrapped in <tool_call> ... </tool_call> containing JSON with name and arguments; each observation $obs_t$ is inserted as <tool_response> ... </tool_response>; and termination occurs when the agent emits <answer> ... </answer> (Chen et al., 29 Dec 2025). No free text outside the required tags is permitted.

The evaluation metric is pass@1-style accuracy computed by an LLM-as-Judge. For benchmark item $i$ , with ground-truth answer $g_i$ and model output $o_i$ , correctness is: $R_{acc}^{(i)} = \begin{cases} 1.0 & \text{if Judge returns "1" (Correct)}, \ 0.0 & \text{if Judge returns "0" (Incorrect)}. \end{cases}$ Category-level accuracy is then

$\text{Acc}_c = \frac{1}{N_c} \sum_{i=1}^{N_c} R_{acc}^{(i)},$

and the reported overall average is the mean across the six categories (Chen et al., 29 Dec 2025). All reported benchmark numbers use temperature $0.7$ and top- $p$ $a_t$ 0.

The judge infrastructure also parses and semantically normalizes outputs. The reported evaluation JSON schema contains: $a_t$ 7 This design makes final-answer evaluation more robust to formatting variation, number or unit normalization, and synonymy.

The benchmark environment exposes the same tool APIs used in training: Region Croping/Zooming; Object Grounding & Visual Search; External Text Retrieval; Webpage Content Retrieval (url_visit); and Local Code Interpreter (Chen et al., 29 Dec 2025). The object-grounding tool queries MWRD using a bounding box and category. External retrieval returns top search results with title, content, URL, and date. Webpage retrieval returns structured evidence and summaries. The code tool returns stdout, stderr, and a computed result. Evaluation therefore measures not only answer quality but also the agent’s ability to operate within a constrained interface for tool-mediated reasoning.

Although MWE-Bench itself is evaluated by pass@1 accuracy, the MindWatcher training setup uses a hybrid reward built from outcome accuracy, format adherence, and a hallucinated-tool-call penalty: $a_t$ 1 with $a_t$ 2 and $a_t$ 3 (Chen et al., 29 Dec 2025). This is not the benchmark metric proper, but it clarifies which behavioral regularities the authors attempted to optimize before testing on MWE-Bench.

5. Empirical results and diagnostic findings

The main empirical pattern reported for MWE-Bench is a large gap between direct inference and tool-integrated agentic inference. In direct inference, the best average score among the listed models is Gemini 2.5 Pro at 42.09, while many strong multimodal models cluster substantially lower; GPT-4o scores 27.75, Qwen2.5-VL-32B scores 25.92, and GPT-5 mini scores 33.97 (Chen et al., 29 Dec 2025). This is presented as evidence that parametric knowledge alone is inadequate on the benchmark.

In ReAct/Agent mode, scores rise markedly. Reported averages include 66.65 for Gemini 2.5 Flash agent, 69.91 for GPT-5 mini agent, 66.95 for Qwen3-VL-32B Thinking (TIR), and 75.35 for MindWatcher-32B, which is reported as the best overall result on the benchmark (Chen et al., 29 Dec 2025). Distilled MindWatcher variants also score strongly: 64.76 for MindWatcher-2B, 64.48 for MindWatcher-3B, and 69.63 for MindWatcher-4B. Table 3 further reports large gains over the respective bases, including Qwen2.5-VL-3B Instruct improving from 24.93 to 64.48 after distillation into MindWatcher-3B.

Category-wise, MindWatcher-32B is strongest overall but not uniformly best in every subset. It reports 71.31 on Car, 86.04 on Animal, 88.92 on Plant, 77.78 on Person, 47.78 on Landmark, and 46.48 on Sports (Chen et al., 29 Dec 2025). Sports remains relatively difficult: GPT-5 mini agent reaches 80.28 on Sports, substantially above MindWatcher-32B, indicating that even within the benchmark’s intended regime, some subsets remain especially sensitive to external knowledge freshness and retrieval quality.

The paper also uses MWE-Bench as a lens for analyzing agent behavior rather than only final accuracy. One result concerns search-engine dependence: on sports subsets, average accuracy varies from 8.51 with Sogou to 16.66 with Bing and 34.81 with Quark, with especially large differences on Chinese football queries (Chen et al., 29 Dec 2025). This implies that measured agent capacity on MWE-Bench is substantially entangled with the quality of tool backends.

A second result concerns tool-triggering decisions. Using tool-call-round statistics, the study reports that GPT-5 mini chooses zero tool calls on nearly one-sixth of samples, achieving only about 51.2% accuracy in that bucket, a pattern described as “blind self-confidence,” whereas MindWatcher-32B calls tools more consistently and thereby achieves higher overall performance (Chen et al., 29 Dec 2025). This finding locates a major failure mode not in downstream reasoning after retrieval, but in the policy boundary that decides whether retrieval should be invoked at all.

A third diagnostic result is the “genetic inheritance” or “Genetic Constraint” analysis. Comparing MindWatcher-32B to its base model Qwen2.5-VL-32B across tool-call rounds, the benchmark reveals that RL improves accuracy by a roughly constant offset while preserving an almost identical decay pattern as the number of required tool calls grows. The study summarizes this empirically as

$a_t$ 4

with $a_t$ 5 roughly constant (Chen et al., 29 Dec 2025). This suggests that, within this setup, RL acts principally as a policy optimizer for tool use rather than as a mechanism that fundamentally changes long-range reasoning ceilings.

6. Comparative context, limitations, and benchmark evolution

MWE-Bench is explicitly compared with MMSearch, SimpleVQA, and WebWalkerQA. The claimed distinction is that these external benchmarks, even after filtering, remain more text-dominant and less dependent on image manipulation or cross-modal coordination, whereas MWE-Bench is tightly coupled to an agentic tool suite centered on visual seeds, local image retrieval, external search, and multi-step synthesis (Chen et al., 29 Dec 2025). Its core evaluand is therefore not only factual answering but multimodal planning over tools.

Relative to LLF-Bench, which evaluates interactive learning from natural-language feedback across eight sequential decision-making task families, MWE-Bench addresses a different axis of capability: tool-integrated multimodal reasoning rather than adaptation from feedback (Cheng et al., 2023). The comparison is still methodologically informative. LLF-Bench emphasizes unified interfaces, environment randomization, and the distinction between learning signals and evaluation signals. This suggests a useful interpretive contrast: MWE-Bench is strong on realistic multimodal tool orchestration, while LLF-Bench more directly targets interactive adaptation under language feedback. A plausible implication is that the two benchmarks probe complementary layers of agent competence rather than interchangeable ones.

The benchmark also has identifiable limitations. Domain coverage is restricted to six categories. Performance is tightly coupled to the tool environment, including search-engine quality and the coverage of MWRD. Some tasks remain entangled with base-model world knowledge, as illustrated by the Manuela Sáenz case in which tools are insufficient unless the model already knows the relevant figure name (Chen et al., 29 Dec 2025). The benchmark therefore does not fully isolate TIR capability from parametric knowledge; when the provided tools cannot bridge the missing prior, the test partially regresses to a test of world knowledge.

A separate methodological issue concerns benchmark maintenance. ArenaBencher argues that static benchmarks become unreliable under data leakage, memorization, and saturation, and it explicitly notes that for a benchmark like MWE-Bench, automatic benchmark evolution would be relevant for maintaining difficulty, uncovering new failure modes, preserving test objectives, and improving model separability (Liu et al., 9 Oct 2025). ArenaBencher formalizes these goals with metrics for Difficulty, Separability, Fairness, and Alignment and proposes an iterative generation-verification-selection loop over an existing benchmark: $a_t$ 6 MWE-Bench itself is not reported as having been evolved by this framework. However, the connection is direct: because MWE-Bench is intended to diagnose tool-integrated reasoning under a rapidly changing model and tool ecosystem, the ArenaBencher argument suggests that static benchmark validity may degrade over time unless refreshed with alignment-preserving updates (Liu et al., 9 Oct 2025).

Finally, MWE-Bench is distinct from domain-level metacognitive monitoring benchmarks that measure confidence discrimination, such as the 33-model MMLU atlas based on Type-2 AUROC (Cacioli, 21 Apr 2026). That work shows that aggregate scores can conceal strong domain-specific variation in monitoring quality. This does not change MWE-Bench’s definition, which is centered on pass@1 correctness under tool use, but it suggests a plausible extension: benchmark reports for agentic systems could supplement task accuracy with domain-profile diagnostics about tool-triggering confidence, selective retrieval, or self-monitoring of answer reliability. Such an extension would align with MWE-Bench’s existing role as a diagnostic lens for policy behavior rather than merely a leaderboard substrate.

In its published form, MWE-Bench is best understood as a systematic benchmark for multimodal, interleaved, tool-dependent reasoning whose primary contribution lies in coupling curated image-text tasks to an executable tool environment and a strict agent protocol (Chen et al., 29 Dec 2025). Its empirical results indicate that tool invocation policy, retrieval backend quality, and multimodal decomposition strategy are all first-order determinants of benchmark performance, making it a benchmark not just of what a model knows, but of how effectively it can convert perception into action and retrieved evidence into correct final answers.