OWMM-Agent: Multi-Agent Systems in Robotics & Ontologies
- OWMM-Agent is a modular multi-agent system that integrates approaches for both open-world mobile manipulation and ontology matching.
- In mobile manipulation, it employs a vision-language model with autoregressive policy and synthetic data generation to achieve high precision in hierarchical planning and control.
- For ontology matching, it orchestrates Siamese agents and prompt-based tools to fuse multimodal evidence, ensuring robust semantic alignment across complex domains.
OWMM-Agent refers to two distinct high-performance multi-agent systems developed for (1) open-world mobile robot manipulation (Chen et al., 4 Jun 2025), and (2) ontology working and matching (Qiang et al., 2023). Both interpretations derive from the acronym and share a theme of modular, agentic orchestration leveraging recent advances in large-scale neural models. The following article surveys both usages, with technical specifics presented for each system as documented in the respective primary sources.
1. OWMM-Agent for Open World Mobile Manipulation
OWMM-Agent, as described in "OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis" (Chen et al., 4 Jun 2025), is a unified, multi-modal agentic system designed to enable robust, generalizable execution of mobile manipulation tasks by robots in open-world environments. The system is architected around a dedicated foundation model, OWMM-VLM, which fuses vision-language reasoning with classical robotic planning for hierarchical decision-making under diverse instructions and scenarios.
1.1 System Components and Architecture
OWMM-Agent consists of the following components:
- OWMM-VLM: A fine-tuned vision-LLM based on InternVL-2.5, featuring a frozen ViT, a 2-layer trainable projection MLP, a large LLM (processing serialized instructions and robot history), multimodal fusion layers with cross-attention, and an autoregressive policy head emitting high-level action JSON sequences.
- Memory Structure:
- Long-term scene memory: A pose-graph built via a pre-mapping phase, associating nodes (poses) with static RGB scene frames for global context.
- Transient agent state memory: A natural-language action-summary history , appended after each high-level action, encoding task-relevant state features.
- Policy Interface: At every timestep , the agent computes , where is a structured action (e.g., search-scene, navigate-to-point, pick, place) and updates the state history.
- Low-Level Control: Structured actions are dispatched through a function-calling API to modular planners/controllers (e.g., MoveTo, Pick, Place), responsible for navigation, manipulation, and scene-querying at the actuator level.
1.2 Agentic Data Synthesis
OWMM-Agent is enabled by a large-scale synthetic data generation pipeline in Habitat. This pipeline comprises:
- Episode Script Generation: PDDL templates define scene graphs, object/receptacle assignments, and pick-and-place objectives.
- Simulation Rollout: A Fetch robot executes navigation and manipulation, with RGB-D and state trajectories logged.
- Data Serialization: Textual instructions, per-step robot histories, and ground-truth action JSON are generated.
- Automatic Filtering/Augmentation: Invalid or unreachable steps are omitted; language variety is introduced through GPT-4o-driven paraphrasing.
This pipeline produces 21k episodes totaling 235k (image, history, action) samples, supporting high-scale instruction fine-tuning of the VLM.
1.3 Function Calling and Control
OWMM-VLM emits high-level actions as JSON, which are parsed and executed via function calls covering navigation, manipulation, and scene search. The mapping from visual predictions to world or image-frame coordinates for actions leverages explicit camera and depth geometry.
1.4 Experimental Results
Empirical performance is benchmarked on single-step and episodic manipulation tasks, with principal metrics:
- Decision accuracy: Up to 97.85% (action classification), 87.54% (scene-frame selection).
- Affordance grounding: Normalized mean distances for object (0.97), receptacle (0.94), and navigation points (0.88).
- End-to-end episodic success: 21.9% (strict) and 51.5% (lenient) on standard OWMM benchmarks; substantial outperformance of GPT-4o and modular control agents.
- Real-world zero-shot: 90% total single-step accuracy on a Fetch robot; task-specific sub-metrics consistently favor OWMM-VLM-38B over baseline LLM-centric approaches.
Ablation studies reveal that bounding box prediction, inclusion of chain-of-thought reasoning, and scaled data synthesis drive observed performance gains.
1.5 Strengths, Limitations, and Future Extension
OWMM-Agent achieves unified, multi-view multi-modal planning without heavy 3D mapping, supports high-level reasoning via text-based action summarization, and realizes strong zero-shot sim-to-real transfer. Known limits include required pre-mapping, embodiment dependence, and dexterity constraints (suction/simple grasp only). Future directions include tactile feedback integration, lifelong fine-tuning, dynamic scene re-mapping, and cross-embodiment adaptation.
2. OWMM-Agent for Ontology Working and Matching
OWMM-Agent, in the context of ontology matching, refers to an Ontology Working and Matching Multi-Agent architecture adopting the Agent-OM framework as detailed in (Qiang et al., 2023). The system orchestrates ontology alignment and entity matching through Siamese agent modules and prompt-orchestrated tools driven by a single LLM backbone.
2.1 System Architecture
OWMM-Agent consists of:
- Retrieval Agent: Extracts, enriches, and stores semantic entity descriptions with multimodal enrichment (metadata, lexical, graphical), embedding entity information into a hybrid memory structure (PostgreSQL + vector database).
- Matching Agent: Performs candidate search, ranking (via similarity scores and multi-evidence fusion), LLM-backed validation, and bidirectional merging to ensure matching consistency.
- Shared Hybrid Memory: Separates relational metadata from vector-embedded content, serving as a central communication substrate.
- Prompt-based Toolset: Implements extractors, retrievers, validators, summarizers, and merging utilities, callable via structured LLM prompts or API hooks.
2.2 Workflow and Matching Methodology
- Entity Indexing: Entities from ontologies are indexed; embeddings derived and stored for similarity-based retrieval.
- Candidate Selection: For each 0 in 1, retrieve top-2 candidates from 3 with highest cosine similarity.
- Matching Pipeline: Applies hybrid database search, multi-channel matchers (initial, lexical, graphical), Reciprocal Rank Fusion (RRF), LLM-based validation, and result merging across 4 and 5 passes.
- Probability Model: Optionally, a logistic regression model predicts matching likelihood given embedding features and evidence deltas:
6
where 7 encodes evidence disparities.
2.3 Toolset and Integration
Tool coverage includes:
- Ontology Parser (OWL API, rdflib)
- SPARQL interface
- Various prompt-wrapped retrievers and matchers
- Hybrid memory storage and CRUD wrappers (PostgreSQL + pgvector)
- RRF-based match summarization
- LLM-based matching validation
The entire system executes via prompt templates or direct API invocation, compatible with frameworks such as LangChain.
2.4 Evaluation and Benchmarks
OWMM-Agent is evaluated on Ontology Alignment Evaluation Initiative (OAEI) benchmarks:
| Track | Year | F1 Score (Rank) | Notable Performance Patterns |
|---|---|---|---|
| Conference | 2022 | ≈0.83 (3/13) | Few-shot, Type 1 naming, high precision |
| Conference | 2023 | ≈0.82 (5/12) | – |
| Anatomy (Trivial) | 2022 | ≈0.95 (3/11) | Comparable to deep learning systems |
| Anatomy (Trivial) | 2023 | ≈0.95 (4/10) | – |
| Anatomy (Non-trivial) | – | ≈0.77 | Outperforms 9/11 traditional OM systems |
| MSE Track | Varies | 0.80–0.88 (top-tier) | Robust to subsumption noise, abbreviations, hierarchy |
Precision and recall are defined as:
8
where 9 is agent alignments and 0 is the reference alignment.
2.5 Analysis of Few-Shot and Complex OM
OWMM-Agent for OM leverages retrieval-augmented generation (RAG), chain-of-thought (CoT) planning, and in-context exemplars, without requiring model fine-tuning. It demonstrates robustness to challenging settings such as few labeled alignments, deep hierarchies, noisy subsumption, and abbreviation-heavy domains by exploiting multi-evidence fusion and explicit bidirectional search/merging. Prompt engineering and coordinated agent memory ensure correct tool activation and alignment consistency across search directions.
3. Comparative Summary
Both instantiations of OWMM-Agent share characteristics of modular agent orchestration, hybrid memory architectures, structured toolsets for task decomposition, and reliance on foundation models for high-level reasoning. In open-world mobile manipulation, emphasis lies on cross-modal (vision-language) reasoning, hierarchical memory, policy-to-control abstraction, and large-scale synthetic data generation for fine-tuning (Chen et al., 4 Jun 2025). In ontology matching, the focus is on multi-agent prompt frameworks, evidence fusion, and LLM-backed semantic validation under minimal supervision (Qiang et al., 2023). Both achieve state-of-the-art or near state-of-the-art performance in their respective evaluation regimes, underscoring the generality and flexibility of agentic architectures built atop modern foundation models.
4. Limitations and Future Directions
OWMM-Agent systems face limitations particular to their domains—dependency on pre-mapped scene graphs and embodiment-specific fine-tuning (manipulation), or on prompt design and retrieval thresholds (ontology matching). Both approaches are amenable to enhancements via online adaptation, lifelong and cross-embodiment learning, incorporation of auxiliary sensory feedback, and further scaling of synthetic or real-world data. A plausible implication is that these agentic paradigms may generalize across additional domains as prompt-tool and memory interfaces mature.
5. References
- "OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis" (Chen et al., 4 Jun 2025)
- "Agent-OM: Leveraging LLM Agents for Ontology Matching" (Qiang et al., 2023)