BEAR-Agent: Modular Multimodal Framework

Updated 24 May 2026

BEAR-Agent is a modular multimodal conversable agent that integrates pretrained vision tools, semantic 3D scene graphs, and affordance cues to enhance embodied reasoning.
It operates at inference time by orchestrating a frozen LLM backbone with explicit Python code execution and dialogue management for perception and planning.
Empirical evaluations on the BEAR benchmark and simulated robotic tasks demonstrate significant performance gains, underscoring its practical impact on embodied intelligence.

BEAR-Agent is a multimodal conversable agent introduced as a wrapper around large pretrained multimodal LLMs (MLLMs), designed to systematically enhance and evaluate the atomic embodied capabilities of such models. It orchestrates vision tools, semantic 3D scene graph construction, and affordance knowledge to drive improved performance on the BEAR benchmark, which targets perception, spatial, and procedural abilities central to embodied intelligence. Unlike end-to-end finetuning approaches, BEAR-Agent operates entirely at inference-time, integrating pretrained components through explicit tool-use, code execution, and modular prompting. The architecture and workflow are presented in detail in Qi et al. (2025) (Qi et al., 9 Oct 2025).

1. System Architecture

BEAR-Agent is conceptualized as a tool-using dialogue wrapper around a frozen MLLM backbone (such as GPT-5 or InternVL3-14B). The architecture consists of the following tightly coordinated modules:

Dialogue Manager: Maintains category-specific prompt templates and manages interaction flow, including the injection of Python helper signatures for each domain (e.g., detection(image, objects), extend_arrow_color(img, color="red") for trajectory tasks). Prompts conclude with explicit completion tokens (ANSWER: ... and TERMINATE).
Vision Tools: Employs externally pretrained vision models—GroundingDINO and Set-of-Mask (SoM) for object class/bbox/mask extraction, and DepthAnything for pseudo-depth cues. The output is formally represented as $\{\,(c_i, b_i, m_i)\,\}_{i=1}^k = \mathrm{GroundingDINO}(I),\, m_i = \mathrm{SetOfMask}(I,\, b_i)$ .
Semantic 3D Scene Graph: Constructs a JSON-based directed graph $G = (\mathcal{V}, \mathcal{E})$ , with nodes $v \in \mathcal{V}$ representing detected objects and edges $(v_j \to v_k) \in \mathcal{E}$ encoding geometric or spatial relations.
Knowledge Base: Supplies procedural and affordance knowledge (e.g., “bottle caps unlock by CCW rotation”) in a hard-coded table, enabling semantic reasoning unavailable in pretrained perceptual models.
Tool-Execution Engine: Runs Python snippets emitted by the MLLM, appending output into the dialogue context. Execution is tightly sandboxed for safety.
Frozen Multimodal LLM: The core LLM accepts interleaved text and vision inputs, with new information surfaced as images or JSON graphs, integrated via cross-attention. No parameters in the LLM or adapters are finetuned during BEAR-Agent operation.

The architecture is orchestrated as a dialogue loop, with tool-use, code generation, and multi-turn reasoning recurring until explicit TERMINATE tokens are encountered.

2. Multimodal Fusion Methodology

Input fusion in BEAR-Agent leverages the pretrained cross-modal adapter pipeline of the underlying MLLM. Each inference round forms an augmented input token sequence comprising:

A category- and context-dependent text prompt of length $T$ ;
$N_v$ vision embeddings $\mathbf{E}_v \in \mathbb{R}^{N_v \times d_v}$ derived from the vision tool outputs.

These are concatenated and linearly projected to the LLM’s token space: $[\,w_1, \dots, w_T,\, p_v(\mathbf{E}_v^1), \dots, p_v(\mathbf{E}_v^{N_v})\,] \in \mathbb{R}^{(T+N_v)\times d}$ with

$p_v(\mathbf{f}) = W_v\,\mathbf{f} + b_v,\quad W_v \in \mathbb{R}^{d \times d_v},\, b_v \in \mathbb{R}^d$

where $W_v$ , $G = (\mathcal{V}, \mathcal{E})$ 0 are fixed from pretraining. No gradients are computed or propagated—BEAR-Agent performs no model-parameter updates post-pretraining.

3. Training and Execution Paradigm

BEAR-Agent does not involve any task-specific finetuning, supervised learning, or reinforcement learning. All learning occurs at offline pretraining for the MLLM and vision tool components. During operation, BEAR-Agent:

Selects vision models (GroundingDINO, DepthAnything) and MLLMs (GPT-5, InternVL3-14B) as fixed inference-time backbones;
Employs hand-coded Python tool functions and procedural cues for each task domain;
Iteratively manages dialogue and tool execution according to static heuristics and prompt templates.

There are no trainable loss functions, backpropagation steps, or hyperparameter schedules associated with BEAR-Agent itself.

4. Algorithmic Loop and Task Heuristics

BEAR-Agent’s operational logic is encapsulated in a repetitive dialogue loop (Algorithm 1):

$G = (\mathcal{V}, \mathcal{E})$ 1

Critical heuristics per atomic capability include:

Pointing: If no explicit tensors are returned, invoke detection(...) to extract object masks, select the centroid of the mask as the answer.
Trajectory: Cycle through candidate colors with extend_arrow_color, select the color whose extension intersects the relevant mask.
Spatial Reasoning: Consult edge relations in the 3D scene graph to infer directions or proximities, then answer with the appropriate multiple-choice label.
Planning: Use explicit temporal chain-of-thought prompts; if model outputs are uncertain, inject steering text such as “watch the last keyframe” or “scan for missed steps”.

These routines ensure category-appropriate chaining of perceptual, symbolic, and procedural cues within the dialogue loop.

5. Quantitative Evaluation and Gains

Empirical evaluation on the BEAR benchmark demonstrates that BEAR-Agent yields statistically significant improvements over baseline MLLMs. The absolute gain in accuracy for GPT-5 on BEAR is +9.12 percentage points (pp), with a relative gain of +17.50%. For InternVL3-14B, the absolute gain is +2.31 pp, or +6.81% relative. Improvements are significant at $G = (\mathcal{V}, \mathcal{E})$ 2 (paired t-test over 7 benchmark splits for GPT-5; $G = (\mathcal{V}, \mathcal{E})$ 3 for InternVL3-14B):

Model	$G = (\mathcal{V}, \mathcal{E})$ 4	$G = (\mathcal{V}, \mathcal{E})$ 5	$G = (\mathcal{V}, \mathcal{E})$ 6 (\%)	$G = (\mathcal{V}, \mathcal{E})$ 7 (\%)	$G = (\mathcal{V}, \mathcal{E})$ 8-value
GPT-5	52.17	61.29	+9.12	+17.50	<0.01
InternVL3-14B	33.93	36.24	+2.31	+6.81	0.03

All accuracy metrics, gain formulas, and significance values strictly follow the paper (Qi et al., 9 Oct 2025):

$G = (\mathcal{V}, \mathcal{E})$ 9

$v \in \mathcal{V}$ 0

6. Downstream Embodied Task Performance

When integrated, without retraining, into MOKA’s keypoint selection loop for simulated robot tasks (ManiSkill 2 tabletop manipulation), BEAR-Agent enables substantial improvements in success rates:

General grasping: baseline 65.3% → BEAR-Agent 87.4% ( $v \in \mathcal{V}$ 1 +21.1 pp)
Spatial pick-&-place: baseline 40.5% → BEAR-Agent 60.7% ( $v \in \mathcal{V}$ 2 +20.2 pp)
Part grasping: baseline 52.0% → BEAR-Agent 72.5% ( $v \in \mathcal{V}$ 3 +20.5 pp)

Averaged across three tasks, BEAR-Agent improves success by approximately +20.17 pp in simulated manipulation. This demonstrates a practical transfer of the improved embodied atomic capabilities (pointing, 3D spatial reasoning, trajectory analysis, and planning) to downstream robotic settings.

7. Modularity and Practical Implications

BEAR-Agent exemplifies a modular, inference-time orchestration paradigm wherein a frozen MLLM is augmented, not by retraining, but by explicit integration of perception, scene-graph, and procedural affordances. Its tool-use interface and prompt engineering methodology provide a practical framework for practitioners seeking to enhance embodied agent performance across diverse tasks without incurring the cost or complexity of end-to-end system retraining. The agent’s statistically significant gains, both on BEAR and in simulated environments, underline the efficacy of modular augmentation strategies for advancing embodied intelligence in LLM-centric architectures (Qi et al., 9 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BEAR-Agent.

BEAR-Agent: Modular Multimodal Framework

1. System Architecture

2. Multimodal Fusion Methodology

3. Training and Execution Paradigm

4. Algorithmic Loop and Task Heuristics

5. Quantitative Evaluation and Gains

6. Downstream Embodied Task Performance

7. Modularity and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BEAR-Agent: Modular Multimodal Framework

1. System Architecture

2. Multimodal Fusion Methodology

3. Training and Execution Paradigm

4. Algorithmic Loop and Task Heuristics

5. Quantitative Evaluation and Gains

6. Downstream Embodied Task Performance

7. Modularity and Practical Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research