Octopus-Bench: Evaluating Multimodal Reasoning

Updated 26 November 2025

Octopus-Bench is a capability-centric evaluation suite that decomposes multimodal reasoning into six atomic, human-analogous capabilities.
It aggregates and re-annotates leading multimodal datasets to assess visual, symbolic, and interactive reasoning tasks with fine-grained precision.
Empirical results demonstrate notable accuracy gains, highlighting the benefits of orchestrating distinct reasoning modules over direct tool usage.

Octopus-Bench is a capability-centric, fine-grained evaluation suite designed for systematic benchmarking of multimodal agentic reasoning systems. Developed in direct support of the "Octopus" paradigm, which emphasizes the autonomous orchestration of six core human-analogous reasoning capabilities, Octopus-Bench enables both aggregate and per-capability assessment over a broad spectrum of visual, symbolic, and interactive reasoning tasks. Its construction involves comprehensive resampling and re-annotation from leading multimodal datasets, with each example explicitly tagged for its relevant reasoning competencies, thus providing detailed insight into the strengths and limitations of agentic, tool-using multimodal systems (Guo et al., 19 Nov 2025).

1. Motivation and Conceptual Foundations

Octopus-Bench addresses a critical limitation in prior multimodal reasoning evaluation suites: the lack of alignment with the diverse, complementary skills that underpin robust human cognition. While earlier frameworks typically include only a subset of visual or linguistic tasks, Octopus-Bench systematically decomposes multimodal reasoning into six "atomic" capabilities, each reflecting an essential aspect of human problem-solving. This decomposition is informed by the observation that effective reasoning with visual data often requires flexible integration of perception, annotation, geometric logic, symbolic programming, image manipulation, and constructive imagination.

2. The Six Core Capabilities

Each instance in Octopus-Bench is tagged with one or more of the following six capability dimensions:

Fine-grained Visual Perception: Extraction of structured cues such as pixel-level features, OCR-extracted text, bounding boxes, and attributes. Human analogue: direct, attentive seeing. Example tools: OCR, grounding_dino, region captioners.
Visual Augmentation and Marking: Overlay of interpretable marks (highlights, arrows, bounding boxes) to externalize intermediate steps. Human analogue: annotation or diagram markup. Tools: highlight, arrow, bounding-box annotator.
Spatial and Geometric Understanding: Reasoning about distances, angles, intersections, areas, and topological relations. Human analogue: geometric inference, visual measurement. Tools: geometry_calculator, geom_perp_intersect.
Logical Programming Reasoning: Synthesis and execution of symbolic programs or algorithmic steps for precise solutions. Human analogue: writing equations or programmatic problem-solving. Tool: code_agent (Claude 4.5 wrapper).
Visual Transformation and Editing: Operations like cropping or segmentation to isolate subproblems. Human analogue: focusing attention by extracting relevant detail. Tools: crop, SAM-based segmentation.
Visual Creation and Generation: Generation of new sketches or abstracted diagrams for internal visualization or planning. Human analogue: drawing simplified diagrams or mental imagery. Tools: generate_image, simplify_image.

This taxonomy enables Octopus-Bench to test not just unconditional accuracy, but the stability and composition of networked cognitive capabilities within an agent.

3. Dataset Composition and Task Taxonomy

Octopus-Bench draws from a diverse array of openly available datasets and benchmarks. Each test instance is mapped to one or more capabilities as shown in the following overview (Editor's term: "capability mapping"):

Capability	Major Sources	Example Count (rounded)
Fine-grained Visual Perception	BLINK, TIR-Bench, IsoBench	~2,000
Visual Augmentation and Marking	BLINK, TIR-Bench	~1,200
Spatial and Geometric Understanding	BLINK, Geometry3K	~1,500
Logical Programming Reasoning	Geometry3K, MathVerse, WeMath, MathVista, MATH-Vision	~1,300
Visual Transformation and Editing	IsoBench, MMVP, nav splits	~900
Visual Creation and Generation	COMT, V*Bench, FrozenLake benchmark	~800

Notably, Octopus-Bench examples are resampled and re-annotated from eight primary sources—BLINK, TIR-Bench, IsoBench, Geometry3K, MathVerse, WeMath, Math-Vision, and MathVista—and supplemented with interactive or long-horizon instances from COMT, V*Bench, MMVP, and a FrozenLake-style navigation suite.

Tasks spanning the benchmark include OCR, object attribute classification, visual correspondence, arrow and bounding-box annotation, depth estimation, geometric proof, programmatic solvers, algebraic equation solving, segmentation, sketch generation, and map creation for navigation.

4. Evaluation Protocol and Metrics

All evaluated systems, including closed-source MLLMs (GPT-4o, Gemini 2.5, Claude 3.5), open-source vision-LLMs (Qwen2.5-VL, LLaVA), pretrained models (DeepEyes, DeepSketcher, VTS-V), and agentic frameworks (Sketchpad, SoM, PyVision, MMFactory), are benchmarked under standardized inference conditions:

Context length is capped at 60% of each backbone’s maximum.
Up to 10 reasoning turns per instance.
Near-deterministic decoding: $\tau = 0.3$ , top- $p=1.0$ .
No extra fine-tuning is performed on any backbone or tool model.

The primary evaluation metric is Accuracy:

$\mathrm{Accuracy} = \frac{1}{N}\sum_{i=1}^N 1(\hat y_i = y_i)$

Aggregate per-capability and overall scores are computed by pooling or averaging across subtasks, respectively.

5. Empirical Results and Comparative Performance

Octopus-Bench provides both overall and fine-grained capability breakdowns. Key summary results for Octopus (GPT-4o backbone) and baseline systems are as follows:

Octopus-Blink (BLINK 14 categories, mean accuracy %):

System	Avg (BLINK14)	Best Competitor
GPT-4o	65.4	–
GPT-4o+Sketchpad	84.2	–
GPT-4o+MMFactory	75.3	–
GPT-4o+Octopus	90.21	+2.94 vs. MMFactory

Octopus-TIR (13 TIR-Bench tasks, mean accuracy %):

System	All	Best Competitor
GPT-4o	17.2	–
GPT-4o+Sketchpad	29.8	–
GPT-4o+PyVision	29.2	–
GPT-4o+Octopus	33.4	+3.6 vs. Sketchpad

Octopus-Math (six visual-math datasets, accuracy %):

System	Octopus Avg	Gemini 2.5 Avg
GPT-4o	47.1	–
Gemini 2.5-Pro	49.8	–
GPT-4o+Octopus	60.4	+10.6 vs. Gemini

These results demonstrate that Octopus achieves superior mean accuracies across the majority of sub-benchmarks, particularly excelling in agentic reasoning settings requiring dynamic capability orchestration (Guo et al., 19 Nov 2025).

6. Analytical Findings and Open Questions

Several key observations arise from the Octopus-Bench paper:

Uniformity of Skill Profile: Octopus exhibits a notably uniform performance across all six core dimensions, in contrast to baseline models, with most pronounced gains (+5–10%) in spatial/geometric and logical programming tasks due to explicit orchestration.
Ablation Results: Removing the Logical Programming module causes the largest drop (≈10%) in overall performance, underscoring the pivotal role of algorithmic and symbolic reasoning in complex visual tasks.
Capability-First vs. Tool-First: Eliminating the intermediate “select capability” step and directly invoking tools degrades accuracy by 4–7%, indicating that human-like capability grouping stabilizes agentic decision-making.
Current Strengths: The strongest areas for Octopus are multi-step visual reasoning involving cross-capability coordination (e.g., maze path planning, geometric proof, diagram correspondence).
Remaining Limitations: Areas such as super-resolution and fine-grained color classification remain challenging, suggesting that base-level perception modules and perceptual feedback mechanisms could benefit from further refinement.

A plausible implication is that structuring multimodal reasoning systems around explicit, human-analogous capability dimensions, rather than only tool selection, may yield not only measurable improvements in accuracy but also produce agents whose stepwise reasoning and error profiles are more interpretable.

7. Significance and Future Directions

By enabling both aggregate and capability-specific evaluation, Octopus-Bench sets a new standard for systematic, interpretable benchmarking of agentic multimodal systems. The findings emphasize that orchestrating diverse cognitive modules—as opposed to monolithic or one-shot tool use—can drive substantial improvements in task flexibility, accuracy, and robustness. Ongoing challenges include refining perceptual modules for finer visual discrimination and expanding capability definitions to accommodate emerging task types in interactive and embodied AI scenarios (Guo et al., 19 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Octopus-Bench.