VTC-Bench: Visual Toolchain Benchmark

Updated 4 July 2026

VTC-Bench is a benchmark for evaluating tool-use proficiency where models must plan and chain multiple OpenCV operations to solve complex visual tasks.
It organizes 680 curated problems into a nine-category cognitive hierarchy, emphasizing pre-processing to high-level compositional visual reasoning.
The benchmark evaluates not only final answer accuracy but also tool selection, multi-step composition efficiency, and adherence to ground-truth execution trajectories.

VTC-Bench, short for VisualToolChain-Bench, is a benchmark for evaluating tool-use proficiency in Multimodal LLMs (MLLMs) under a setting the literature frames as a shift from passive visual understanding to agentic multimodal reasoning. Rather than testing only visual question answering on fixed inputs, it evaluates whether a model can decide when to use tools, choose appropriate visual operations, and chain them across multiple steps to solve complex visual problems. The benchmark is built around a broad OpenCV-style tool library, a curated dataset of 680 problems, and ground-truth execution trajectories intended to expose limitations in multi-tool composition, long-horizon plan execution, and adaptation to diverse visual operations (Zhu et al., 16 Mar 2026).

1. Conceptual scope and motivation

VTC-Bench was introduced to address a perceived gap in multimodal evaluation: standard MLLM benchmarks mostly test static perception and reasoning on fixed inputs, while earlier tool-use benchmarks often rely on a small number of tools and short, simple tool calls. In contrast, practical visual tasks may require restoring a degraded image, isolating a region, extracting features, measuring geometry, or verifying an inference before producing a final answer. VTC-Bench is designed to test that full workflow.

The benchmark therefore operationalizes visual agency as compositional visual tool chaining. In this setting, competence is not exhausted by recognizing objects or answering directly from an image. A capable model must formulate an execution strategy, select tools that match the task, configure them coherently, and maintain a sensible trajectory across multiple operations. This emphasis on trajectory quality is central to the benchmark’s design, because the benchmark treats visual tool use as a planning-and-execution problem rather than as a thin wrapper around ordinary image understanding (Zhu et al., 16 Mar 2026).

This suggests that VTC-Bench occupies a different evaluative niche from benchmarks centered on perception, OCR, or single-step function calling. A plausible implication is that it is intended not merely to score final answers, but to diagnose whether an MLLM can function as a visual agent in realistic computer vision pipelines.

2. Tool library and operational design

The benchmark’s core tool library consists of 32 diverse OpenCV-based visual operations. These are organized into four functional modules that mirror a human-like image-processing workflow: Geometry, Enhancement, Feature Extraction, and Drawing. The paper also notes that the appendix contains a slightly expanded taxonomy table listing 35 named OpenCV-based tools, while the main benchmark design emphasizes a curated set of 32 operations as the core library (Zhu et al., 16 Mar 2026).

Module	Representative operations
Geometry	resize, rotate, translate, flip, crop, zoom in, pyramid
Enhancement	convert color, in-range color, blur, denoise, threshold, morphology, histogram, adjust brightness, inpaint
Feature Extraction	canny edge detection, gradients, watershed, grabcut, floodfill, connected components, keypoint features, hough lines, hough circles, template match, DFT
Drawing	draw contours, approx poly, draw line, draw circle, contour area, arc length

The breadth of this tool-set is an explicit design choice. The benchmark is intentionally built around extensive combinations of operations rather than a tiny set of helper functions, so that it can assess long-horizon, multi-step plan execution and not just isolated tool calls. This matters because models that appear competent with a narrow repertoire may still fail when they must adapt to unfamiliar operations or combine several heterogeneous functions into a coherent pipeline.

The benchmark’s tool-centric perspective also distinguishes it from environments in which tool use is largely decorative. Here, the tools are meant to be instrumental to task completion, and controlled perturbations are introduced specifically to force their use.

3. Dataset structure and cognitive hierarchy

VTC-Bench contains 680 curated problems organized into a nine-category cognitive hierarchy. The nine task categories are Attention Focusing, Chart, Color, Counting, Math, Measurement, Perceptual Restoration, Robust OCR, and Spatial Reasoning. These are arranged into three progressive tiers that reflect increasing cognitive complexity (Zhu et al., 16 Mar 2026).

Tier	Included categories
Tier 1: Visual Perception Enhancement	Robust OCR, Perceptual Restoration, Attention Focusing
Tier 2: Quantitative Visual Estimation	Measurement, Color, Counting
Tier 3: Compositional Visual Reasoning	Chart, Math, Spatial Reasoning

The tiering is not merely organizational. Tier 1 emphasizes pre-processing and normalization under distortions such as noise, low light, haze, or orientation changes. Tier 2 requires fine-grained extraction of physical or visual quantities. Tier 3 requires multi-step logical deduction and the construction or verification of intermediate visual evidence. The hierarchy is therefore intended to span a spectrum from simple pre-processing to higher-level constructive reasoning.

The construction process is likewise part of the benchmark’s methodology. Problems are curated from web-crawled images and repurposed open-source datasets, then rewritten from a tool-centric perspective so that the model must infer a useful execution strategy rather than answer from memory. Controlled perturbations such as rotation, blur, haze, and overexposure are introduced to force tool use. The final dataset includes 538 multiple-choice questions and 142 open-ended questions, with an average toolchain length of 5.04, an average of 4.97 unique tools per problem, and 3,428 total tool calls (Zhu et al., 16 Mar 2026).

These statistics indicate that the benchmark is deliberately structured around nontrivial composition. A plausible implication is that short-horizon heuristics are unlikely to be sufficient except on a limited subset of the task distribution.

4. Ground-truth trajectories and evaluation methodology

A defining feature of VTC-Bench is that every problem comes with a ground-truth execution trajectory. The reference trajectory specifies the minimal or intended sequence of tool calls needed to solve the problem. The paper defines an “Effective Toolchain” as the minimal sequence of tool calls that produces the final answer, obtained by backtracking from the final output to the original input image (Zhu et al., 16 Mar 2026).

This trajectory annotation allows the benchmark to evaluate not only answer correctness but also plan quality. Its primary accuracy metric is Average Pass Rate (APR), but it also reports Tool Call Rate (TCR), Mean Absolute Error (MAE) in chain length, and Tool Usage Efficiency, written as $Eff_{\text{tool}}$ in the paper. MAE compares the length of the predicted toolchain to the ground-truth length, while $Eff_{\text{tool}}$ measures how much of the model’s predicted sequence is actually effective.

The consequence is methodological rather than cosmetic. A model can be correct yet inefficient, and that inefficiency is itself scored. In an agentic setting, this matters because an overlong or redundant trajectory may indicate poor planning, weak tool discrimination, or lack of verification. Conversely, a trajectory that is too short may reveal premature termination or failure to perform necessary intermediate transformations.

The benchmark uses two interaction paradigms: a code interface and a tool-call interface. For models that cannot natively use tools well, the authors use the Thyme framework to help them generate code. Final evaluation combines deterministic matching with GPT-4o-based judgment for open-ended outputs. The ground-truth trajectories also serve a diagnostic function, since they make it possible to compare the model’s realized plan against the intended path and thereby analyze execution behavior separately from endpoint accuracy (Zhu et al., 16 Mar 2026).

5. Empirical results and observed failure modes

The reported experiments evaluate 19 leading MLLMs, including proprietary tool-use models, proprietary general-purpose models, open-source tool-use models, and open-source general-purpose models. Overall performance is low across the benchmark, with scores roughly ranging from 22.06\% to 46.47\% in the base setting. The strongest model overall is Gemini-3.0-Pro, which reaches 51.18\% with tools; the paper also describes this as about 51\% performance on the benchmark. Gemini-3.0-Flash is also strong, and GPT-5.2 and GPT-4o gain substantially from tool use, while open-source models generally lag behind and often benefit far less from tool augmentation (Zhu et al., 16 Mar 2026).

The failure analysis identifies several recurrent patterns. First, models struggle with unseen operations and diverse tool sets. They tend to rely on a narrow subset of familiar functions, especially zoom, crop, rotate, histogram, and connected components, rather than selecting the tool that best matches the task. This is presented as evidence of weak generalization to operations not heavily represented in training.

Second, multi-tool composition remains a persistent bottleneck. Even when models can invoke individual tools, they often fail to assemble them into a coherent long-horizon plan. Third, planning efficiency is poor. Some strong models generate very long and redundant tool sequences. GPT-5.2 is reported to have a high tool-call rate but very low tool-use efficiency, meaning that it calls many tools but uses them inefficiently; GPT-o3 and Gemini-3.0-Pro also exhibit low efficiencies relative to the number of calls they make.

A complementary pathology is premature truncation. The analysis of distributional mismatch shows that predicted toolchains are often much shorter than needed, peaking at one or two steps where the reference trajectories often require four to six. This is especially visible in categories such as color and measurement, where models frequently terminate too early. Case studies identify two additional recurring failures: choosing the wrong tools and applying them incorrectly, or blindly trusting intermediate tool outputs without cross-checking them against the original image (Zhu et al., 16 Mar 2026).

Taken together, these findings frame VTC-Bench as a diagnostic benchmark rather than merely a harder accuracy test. The benchmark is intended to reveal where visual agency breaks down: tool selection, chain composition, efficiency, and verification.

The name “VTC-Bench” is not unique in the recent literature, and this can produce confusion. In the VisualToolChain-Bench sense, VTC-Bench refers to the multimodal tool-chaining benchmark described above (Zhu et al., 16 Mar 2026). However, the same or closely similar naming has also been used in other subfields.

One such instance is a task-specific evaluation framework for visual token compression methods, also called VTC-Bench, which uses downsampling as a data filter to denoise existing multimodal benchmarks and evaluate compression-relevant samples more fairly (Liao et al., 8 Oct 2025). A separate benchmark, written as VTCBench, studies whether vision-LLMs can understand long context when text is compressed into images through vision-text compression; it evaluates retrieval, reasoning, and memory under compressed visual text and introduces VTCBench-Wild for varied rendering conditions (Zhao et al., 17 Dec 2025).

This naming overlap can obscure substantive differences. VisualToolChain-Bench is concerned with compositional visual tool use in MLLMs; the visual token compression framework is concerned with evaluation methodology for compressed visual tokens; VTCBench studies long-context understanding under vision-text compression. The similarity in names does not imply similarity in task design or scientific objective.

Within its own domain, VisualToolChain-Bench is positioned as a rigorous baseline and testbed for visual agentic capabilities, not just multimodal recognition. By requiring models to plan, compose, and verify multi-step visual toolchains, it exposes a gap between apparent tool-use competence and real operational skill. The benchmark’s broader significance lies in its claim that strong visual agents must integrate perception, planning, and tool execution in a coordinated way, and that current models remain limited in robust generalization to new tools, correct multi-step composition, and efficient execution aligned with a ground-truth reasoning trajectory (Zhu et al., 16 Mar 2026).