OWMM-Agent: Multi-Agent Systems in Robotics & Ontologies

Updated 10 April 2026

OWMM-Agent is a modular multi-agent system that integrates approaches for both open-world mobile manipulation and ontology matching.
In mobile manipulation, it employs a vision-language model with autoregressive policy and synthetic data generation to achieve high precision in hierarchical planning and control.
For ontology matching, it orchestrates Siamese agents and prompt-based tools to fuse multimodal evidence, ensuring robust semantic alignment across complex domains.

OWMM-Agent refers to two distinct high-performance multi-agent systems developed for (1) open-world mobile robot manipulation (Chen et al., 4 Jun 2025), and (2) ontology working and matching (Qiang et al., 2023). Both interpretations derive from the acronym and share a theme of modular, agentic orchestration leveraging recent advances in large-scale neural models. The following article surveys both usages, with technical specifics presented for each system as documented in the respective primary sources.

1. OWMM-Agent for Open World Mobile Manipulation

OWMM-Agent, as described in "OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis" (Chen et al., 4 Jun 2025), is a unified, multi-modal agentic system designed to enable robust, generalizable execution of mobile manipulation tasks by robots in open-world environments. The system is architected around a dedicated foundation model, OWMM-VLM, which fuses vision-language reasoning with classical robotic planning for hierarchical decision-making under diverse instructions and scenarios.

1.1 System Components and Architecture

OWMM-Agent consists of the following components:

OWMM-VLM: A fine-tuned vision-LLM based on InternVL-2.5, featuring a frozen ViT, a 2-layer trainable projection MLP, a large LLM (processing serialized instructions and robot history), multimodal fusion layers with cross-attention, and an autoregressive policy head emitting high-level action JSON sequences.
Memory Structure:
- Long-term scene memory: A pose-graph $G=(V,E)$ built via a pre-mapping phase, associating nodes (poses) with static RGB scene frames for global context.
- Transient agent state memory: A natural-language action-summary history $\mathcal{H}_t$ , appended after each high-level action, encoding task-relevant state features.
Policy Interface: At every timestep $t$ , the agent computes $(A_t, \mathcal{H}_t) = F_{vlm}(\mathcal{L}, G, I, I^c_t, \mathcal{H}_{t-1})$ , where $A_t$ is a structured action (e.g., search-scene, navigate-to-point, pick, place) and $\mathcal{H}_t$ updates the state history.
Low-Level Control: Structured actions are dispatched through a function-calling API to modular planners/controllers (e.g., MoveTo, Pick, Place), responsible for navigation, manipulation, and scene-querying at the actuator level.

1.2 Agentic Data Synthesis

OWMM-Agent is enabled by a large-scale synthetic data generation pipeline in Habitat. This pipeline comprises:

Episode Script Generation: PDDL templates define scene graphs, object/receptacle assignments, and pick-and-place objectives.
Simulation Rollout: A Fetch robot executes navigation and manipulation, with RGB-D and state trajectories logged.
Data Serialization: Textual instructions, per-step robot histories, and ground-truth action JSON are generated.
Automatic Filtering/Augmentation: Invalid or unreachable steps are omitted; language variety is introduced through GPT-4o-driven paraphrasing.

This pipeline produces $\sim$ 21k episodes totaling $\sim$ 235k (image, history, action) samples, supporting high-scale instruction fine-tuning of the VLM.

1.3 Function Calling and Control

OWMM-VLM emits high-level actions as JSON, which are parsed and executed via function calls covering navigation, manipulation, and scene search. The mapping from visual predictions to world or image-frame coordinates for actions leverages explicit camera and depth geometry.

1.4 Experimental Results

Empirical performance is benchmarked on single-step and episodic manipulation tasks, with principal metrics:

Decision accuracy: Up to 97.85% (action classification), 87.54% (scene-frame selection).
Affordance grounding: Normalized mean distances for object (0.97), receptacle (0.94), and navigation points (0.88).
End-to-end episodic success: 21.9% (strict) and 51.5% (lenient) on standard OWMM benchmarks; substantial outperformance of GPT-4o and modular control agents.
Real-world zero-shot: 90% total single-step accuracy on a Fetch robot; task-specific sub-metrics consistently favor OWMM-VLM-38B over baseline LLM-centric approaches.

Ablation studies reveal that bounding box prediction, inclusion of chain-of-thought reasoning, and scaled data synthesis drive observed performance gains.

1.5 Strengths, Limitations, and Future Extension

OWMM-Agent achieves unified, multi-view multi-modal planning without heavy 3D mapping, supports high-level reasoning via text-based action summarization, and realizes strong zero-shot sim-to-real transfer. Known limits include required pre-mapping, embodiment dependence, and dexterity constraints (suction/simple grasp only). Future directions include tactile feedback integration, lifelong fine-tuning, dynamic scene re-mapping, and cross-embodiment adaptation.

2. OWMM-Agent for Ontology Working and Matching

OWMM-Agent, in the context of ontology matching, refers to an Ontology Working and Matching Multi-Agent architecture adopting the Agent-OM framework as detailed in (Qiang et al., 2023). The system orchestrates ontology alignment and entity matching through Siamese agent modules and prompt-orchestrated tools driven by a single LLM backbone.

2.1 System Architecture

OWMM-Agent consists of:

Retrieval Agent: Extracts, enriches, and stores semantic entity descriptions with multimodal enrichment (metadata, lexical, graphical), embedding entity information into a hybrid memory structure (PostgreSQL + vector database).
Matching Agent: Performs candidate search, ranking (via similarity scores and multi-evidence fusion), LLM-backed validation, and bidirectional merging to ensure matching consistency.
Shared Hybrid Memory: Separates relational metadata from vector-embedded content, serving as a central communication substrate.
Prompt-based Toolset: Implements extractors, retrievers, validators, summarizers, and merging utilities, callable via structured LLM prompts or API hooks.

2.2 Workflow and Matching Methodology

Entity Indexing: Entities from ontologies $O_s, O_t$ are indexed; embeddings $\phi(\cdot)$ derived and stored for similarity-based retrieval.
Candidate Selection: For each $\mathcal{H}_t$ 0 in $\mathcal{H}_t$ 1, retrieve top- $\mathcal{H}_t$ 2 candidates from $\mathcal{H}_t$ 3 with highest cosine similarity.
Matching Pipeline: Applies hybrid database search, multi-channel matchers (initial, lexical, graphical), Reciprocal Rank Fusion (RRF), LLM-based validation, and result merging across $\mathcal{H}_t$ 4 and $\mathcal{H}_t$ 5 passes.
Probability Model: Optionally, a logistic regression model predicts matching likelihood given embedding features and evidence deltas:

$\mathcal{H}_t$ 6

where $\mathcal{H}_t$ 7 encodes evidence disparities.

2.3 Toolset and Integration

Tool coverage includes:

Ontology Parser (OWL API, rdflib)
SPARQL interface
Various prompt-wrapped retrievers and matchers
Hybrid memory storage and CRUD wrappers (PostgreSQL + pgvector)
RRF-based match summarization
LLM-based matching validation

The entire system executes via prompt templates or direct API invocation, compatible with frameworks such as LangChain.

2.4 Evaluation and Benchmarks

OWMM-Agent is evaluated on Ontology Alignment Evaluation Initiative (OAEI) benchmarks:

Track	Year	F1 Score (Rank)	Notable Performance Patterns
Conference	2022	≈0.83 (3/13)	Few-shot, Type 1 naming, high precision
Conference	2023	≈0.82 (5/12)	–
Anatomy (Trivial)	2022	≈0.95 (3/11)	Comparable to deep learning systems
Anatomy (Trivial)	2023	≈0.95 (4/10)	–
Anatomy (Non-trivial)	–	≈0.77	Outperforms 9/11 traditional OM systems
MSE Track	Varies	0.80–0.88 (top-tier)	Robust to subsumption noise, abbreviations, hierarchy

Precision and recall are defined as:

$\mathcal{H}_t$ 8

where $\mathcal{H}_t$ 9 is agent alignments and $t$ 0 is the reference alignment.

2.5 Analysis of Few-Shot and Complex OM

OWMM-Agent for OM leverages retrieval-augmented generation (RAG), chain-of-thought (CoT) planning, and in-context exemplars, without requiring model fine-tuning. It demonstrates robustness to challenging settings such as few labeled alignments, deep hierarchies, noisy subsumption, and abbreviation-heavy domains by exploiting multi-evidence fusion and explicit bidirectional search/merging. Prompt engineering and coordinated agent memory ensure correct tool activation and alignment consistency across search directions.

3. Comparative Summary

Both instantiations of OWMM-Agent share characteristics of modular agent orchestration, hybrid memory architectures, structured toolsets for task decomposition, and reliance on foundation models for high-level reasoning. In open-world mobile manipulation, emphasis lies on cross-modal (vision-language) reasoning, hierarchical memory, policy-to-control abstraction, and large-scale synthetic data generation for fine-tuning (Chen et al., 4 Jun 2025). In ontology matching, the focus is on multi-agent prompt frameworks, evidence fusion, and LLM-backed semantic validation under minimal supervision (Qiang et al., 2023). Both achieve state-of-the-art or near state-of-the-art performance in their respective evaluation regimes, underscoring the generality and flexibility of agentic architectures built atop modern foundation models.

4. Limitations and Future Directions

OWMM-Agent systems face limitations particular to their domains—dependency on pre-mapped scene graphs and embodiment-specific fine-tuning (manipulation), or on prompt design and retrieval thresholds (ontology matching). Both approaches are amenable to enhancements via online adaptation, lifelong and cross-embodiment learning, incorporation of auxiliary sensory feedback, and further scaling of synthetic or real-world data. A plausible implication is that these agentic paradigms may generalize across additional domains as prompt-tool and memory interfaces mature.

5. References

"OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis" (Chen et al., 4 Jun 2025)
"Agent-OM: Leveraging LLM Agents for Ontology Matching" (Qiang et al., 2023)

Markdown Report Issue Upgrade to Chat

References (2)

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis (2025)

Agent-OM: Leveraging LLM Agents for Ontology Matching (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OWMM-Agent.

OWMM-Agent: Multi-Agent Systems in Robotics & Ontologies

1. OWMM-Agent for Open World Mobile Manipulation

1.1 System Components and Architecture

1.2 Agentic Data Synthesis

1.3 Function Calling and Control

1.4 Experimental Results

1.5 Strengths, Limitations, and Future Extension

2. OWMM-Agent for Ontology Working and Matching

2.1 System Architecture

2.2 Workflow and Matching Methodology

2.3 Toolset and Integration

2.4 Evaluation and Benchmarks

2.5 Analysis of Few-Shot and Complex OM

3. Comparative Summary

4. Limitations and Future Directions

5. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

OWMM-Agent: Multi-Agent Systems in Robotics & Ontologies

1. OWMM-Agent for Open World Mobile Manipulation

1.1 System Components and Architecture

1.2 Agentic Data Synthesis

1.3 Function Calling and Control

1.4 Experimental Results

1.5 Strengths, Limitations, and Future Extension

2. OWMM-Agent for Ontology Working and Matching

2.1 System Architecture

2.2 Workflow and Matching Methodology

2.3 Toolset and Integration

2.4 Evaluation and Benchmarks

2.5 Analysis of Few-Shot and Complex OM

3. Comparative Summary

4. Limitations and Future Directions

5. References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research