Octopus: Agentic Multimodal Reasoning

Updated 26 November 2025

The paper introduces a framework that decomposes multimodal reasoning into six atomic capabilities, enabling dynamic integration of perception, logic, spatial, and generative skills.
It employs a two-stage agentic loop where a multimodal LLM selects and orchestrates specialized tools, leading to state-of-the-art performance on open-world, compositional tasks.
Empirical evaluations show that removing any single capability, particularly logic, significantly reduces accuracy, underscoring the importance of coordinated capability orchestration.

Octopus is a research paradigm and system for agentic multimodal reasoning, formulated to address the limitations of prior vision-language and multimodal models in autonomous problem-solving. It provides a unifying framework that decomposes human-like reasoning into atomic capabilities, oversees their orchestration with an agentic loop, and demonstrates state-of-the-art performance on compositional, open-world tasks that require the dynamic integration of perception, logic, spatial geometry, augmentation, transformation, and generative abilities. Octopus integrates neural and symbolic components and has distinct instantiations both as embodied code-generating agents and as benchmark-oriented reasoning orchestrators, with rigorous evaluations showing the benefit of capability coordination across the multimodal reasoning spectrum (Guo et al., 19 Nov 2025, Yang et al., 2023).

1. Six-Capability Decomposition of Multimodal Reasoning

The Octopus framework defines agentic reasoning over multimodal data via six atomic capabilities, each associated with a distinct family of tools and functions. At each reasoning step $i$ , the agent chooses a capability $C_i$ from the core set: $C_i \in \{\,C_{\rm percept},\;C_{\rm aug},\;C_{\rm spatial},\;C_{\rm logic},\;C_{\rm transform},\;C_{\rm gen}\}$ Descriptions:

$C_{\rm percept}$ (Fine-grained Visual Perception): Structured extraction from pixels or regions (e.g., OCR, object grounding), as in

$\mathcal{E}_i = \mathcal{E}_{i-1} \cup \{\mathrm{OCR}(I)\}$

$C_{\rm aug}$ (Visual Augmentation & Marking): Overlaying interpretable visual annotations or highlights, e.g.,

$\mathcal{E}_i = \mathcal{E}_{i-1} \cup \{I \oplus \mathrm{annotate}(\cdot)\}$

$C_{\rm spatial}$ (Spatial/Geometric Understanding): Calculating geometric relations such as distances, angles, intersections

$(\mathrm{region}_1, \mathrm{region}_2) \xrightarrow{\mathrm{geom\_calc}} \Delta x, \Delta y, \angle, \mathrm{area}$

$C_{\rm logic}$ (Logical Programming Reasoning): Symbolic code writing and execution for deduction, arithmetic, and algorithmic logic:

$\mathrm{output} = \mathrm{exec\_code}(\texttt{code\_string});\quad \mathcal{E}_i = \mathcal{E}_{i-1} \cup \{\mathrm{output}\}$

$C_{\rm transform}$ (Visual Transformation & Editing): Segmenting, cropping, or manipulating image regions for focused analysis,

$\mathcal{E}_i = \mathcal{E}_{i-1} \cup \{I_{\rm cropped} = I_{\mathrm{input}[x_1:x_2, y_1:y_2]}\}$

$C_{\rm gen}$ (Visual Creation & Generation): Producing new or simplified images, e.g., generating diagrams from textual prompts,

$\mathcal{E}_i = \mathcal{E}_{i-1} \cup \{\tilde{I} = \mathrm{GenVis}(\mathrm{description})\}$

The orchestration of these capabilities allows the agent to dynamically adapt to the requirements of complex tasks, bridging low-level extraction, structural reasoning, and imaginative generation (Guo et al., 19 Nov 2025).

2. Architecture and Agentic Orchestration

Octopus deploys a two-stage agentic loop, consisting of high-level planning via a backbone multimodal LLM and tool invocation for capability realization:

Planning and Capability Selection: The multimodal LLM (e.g., GPT-4o) receives the accumulated observations $\mathcal{E}_{i-1}$ , previous capabilities $C_{<i}$ , and task $Q_T$ . It generates a reasoning sequence $R_i$ that explicitly selects a capability:

$R_i = \arg\max_{R}\; \pi(R \mid I_{\mathrm{input}}, Q_T, C_{<i}, R_{<i}, \mathcal{E}_{i-1})$

The chosen capability $C_i$ is extracted from $<\mathrm{cap}>$ tags in $R_i$ .

Tool Invocation and State Update: Each capability is realized via a dedicated tool API. Upon tool execution, the resulting observation is appended to the state for the next turn.
Termination: The agent terminates when the reasoning output $R_i$ ends with the $</\mathrm{answer}>$ token.

Pseudocode for the orchestration process is provided as Algorithm 1 in (Guo et al., 19 Nov 2025). This capability-first routing, rather than direct tool picking, produces increased compositionality and robustness.

3. Neural and Symbolic Integration

Octopus's agentic reasoning paradigm blends neural modules (VLMs, perception models) with symbolic and programmatic reasoning. For logical reasoning steps ( $C_{\rm logic}$ ), separate code-execution models (e.g., Claude 4.5 Sonnet) generate and run Python code fragments. Perceptual and spatial capabilities interface with OCR and geometry calculators (e.g., Gemini 2.5 Flash).

Observations and intermediate outputs are accumulated in an evidence set $\mathcal{E}_i$ , allowing context-aware cross-capability reasoning and closed-loop planning. This neural–symbolic schema enables both “System-II”-style decomposition and “System-I”-style direct execution, reflecting cognitive models of human reasoning (Guo et al., 19 Nov 2025, Liu et al., 7 May 2025).

4. Evaluation Benchmarks and Empirical Performance

Octopus performance is measured using Octopus-Bench, built to assess each of the six capability dimensions. Benchmarks include BLINK, TIR-Bench, IsoBench, Geometry3K, MathVerse, WeMath, MathVista, Math-Vision, COMT, V*Bench, MMVP, and grid-based navigation.

A fragment of the benchmark-to-capability mapping is shown below:

Benchmark	percept	spatial	logic	gen
BLINK–Count	✓
TIR–Color	✓
MazeSolver			✓	✓
IsoBench		✓	✓

On Octopus-BLINK (14 subtasks), GPT-4o+Octopus achieves 71.8% average accuracy, outperforming GPT-4o+MMFactory by 2.9%. In Octopus-TIR, overall accuracy reaches 33.4%, versus 29.8% (Sketchpad) and 29.2% (PyVision). For visual mathematics, accuracies include IsoBench (79.2%), Geometry3K (48.2%), MathVerse (49.2%), WeMath (43.1%), MathVista (75.3%), Math-Vision (65.4%) (Guo et al., 19 Nov 2025).

Ablation studies reveal that removing any single capability results in a 5–10% accuracy decline; logic removal is most severe (up to 10%). Disabling the two-stage routing also reduces accuracy by ~5%.

Performance is robust across backbones such as GPT-4o, Gemini 2.5, Claude 3.5, Qwen2.5, and LLaVA-Next, indicating that capability orchestration is the principal driver (Guo et al., 19 Nov 2025).

5. Embodied Octopus Systems: Vision–Language Programmers

An alternative instantiation appears in "Octopus: Embodied Vision-Language Programmer from Environmental Feedback" (Yang et al., 2023). Here, agentic multimodal reasoning is realized as code generation for acting in simulated or virtual environments. This framework operates as follows:

Vision → Plan → Code Pipeline: The agent receives natural language instructions $T$ and multi-view images $I$ , emits natural language subtask plans, and auto-generates executable code (Python) leveraging simulator/robot APIs.
Architecture:
- Visual backbone: CLIP ViT-L/14.
- Language decoder: MPT-7B with Flamingo-style cross-modal fusion.
- Reward model: CodeLLaMA-7B scalar head.
Training:
- Supervised fine-tuning with rollout data from GPT-4.
- Reinforcement Learning with Environmental Feedback (RLEF): Policy optimization by PPO, reward is environmental task success.
Tasks:
- OctoGibson: 476 indoor tasks, simulated egocentric multi-view, routine and reasoning types.
- OctoGTA: 20 video game tasks in GTA-V.

Octopus (SFT+RLEF) attains a +3 percentage point gain over SFT-only in unseen reasoning tasks; it surpasses previous approaches such as TAPA, EmbodiedGPT, and codeblind LLMs (Yang et al., 2023).

A closely related theme is the system-level agentic and neuroscience-aligned approach described in (Liu et al., 7 May 2025), which advocates hybrid, recursive, and multistep reasoning integrating perception, logic, spatial/temporal manipulation, and interaction.

6. Illustrative Reasoning Traces

A canonical example is maze navigation in the Octopus-Bench suite (Guo et al., 19 Nov 2025):

Stepwise invocation:

Gen: produce a simplified maze diagram via generative visual tools.
Percept: extract agent, goal, and barrier positions from the diagram.
Logic: synthesize and execute code (e.g., “A* search”) to compute the shortest path in the grid.
Final answer output as an action sequence (e.g., $L, U, U, L, L, L, D$ ).

This demonstrates the dynamic, mixed-paradigm nature of Octopus’s agentic reasoning and the integration of neural, symbolic, and code-based subskills.

7. Contributions, Significance, and Future Directions

Octopus establishes a capability-first, agentic paradigm for multimodal reasoning. By formulating atomic cognitive skills and orchestrating them via explicit reasoning over a stateful context, it achieves substantial improvements in adaptability and accuracy for compositional, real-world tasks. The empirical ablations substantiate that code-based (“logic”) and geometric skills are indispensable for open-domain multimodal problems.

Extensions and limitations include the struggle with long or nested code, focus on simulation-bound evaluation, and reliance on current tool APIs. Directions for future research involve scaling to hierarchical task graphs, video and temporal context incorporation, stronger integration with code-centric LLMs, and real-world deployment with robust perception and safety validation (Yang et al., 2023, Guo et al., 19 Nov 2025).

A plausible implication is the growing evidence that modular, orchestrated agentic frameworks—rooted in both biological and engineering design—are essential for achieving generalizable, cognitively-aligned multimodal AI (Liu et al., 7 May 2025).