Papers
Topics
Authors
Recent
Search
2000 character limit reached

OWMM-Agent: Multi-Agent Systems in Robotics & Ontologies

Updated 10 April 2026
  • OWMM-Agent is a modular multi-agent system that integrates approaches for both open-world mobile manipulation and ontology matching.
  • In mobile manipulation, it employs a vision-language model with autoregressive policy and synthetic data generation to achieve high precision in hierarchical planning and control.
  • For ontology matching, it orchestrates Siamese agents and prompt-based tools to fuse multimodal evidence, ensuring robust semantic alignment across complex domains.

OWMM-Agent refers to two distinct high-performance multi-agent systems developed for (1) open-world mobile robot manipulation (Chen et al., 4 Jun 2025), and (2) ontology working and matching (Qiang et al., 2023). Both interpretations derive from the acronym and share a theme of modular, agentic orchestration leveraging recent advances in large-scale neural models. The following article surveys both usages, with technical specifics presented for each system as documented in the respective primary sources.

1. OWMM-Agent for Open World Mobile Manipulation

OWMM-Agent, as described in "OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis" (Chen et al., 4 Jun 2025), is a unified, multi-modal agentic system designed to enable robust, generalizable execution of mobile manipulation tasks by robots in open-world environments. The system is architected around a dedicated foundation model, OWMM-VLM, which fuses vision-language reasoning with classical robotic planning for hierarchical decision-making under diverse instructions and scenarios.

1.1 System Components and Architecture

OWMM-Agent consists of the following components:

  • OWMM-VLM: A fine-tuned vision-LLM based on InternVL-2.5, featuring a frozen ViT, a 2-layer trainable projection MLP, a large LLM (processing serialized instructions and robot history), multimodal fusion layers with cross-attention, and an autoregressive policy head emitting high-level action JSON sequences.
  • Memory Structure:
    • Long-term scene memory: A pose-graph G=(V,E)G=(V,E) built via a pre-mapping phase, associating nodes (poses) with static RGB scene frames for global context.
    • Transient agent state memory: A natural-language action-summary history Ht\mathcal{H}_t, appended after each high-level action, encoding task-relevant state features.
  • Policy Interface: At every timestep tt, the agent computes (At,Ht)=Fvlm(L,G,I,Itc,Ht−1)(A_t, \mathcal{H}_t) = F_{vlm}(\mathcal{L}, G, I, I^c_t, \mathcal{H}_{t-1}), where AtA_t is a structured action (e.g., search-scene, navigate-to-point, pick, place) and Ht\mathcal{H}_t updates the state history.
  • Low-Level Control: Structured actions are dispatched through a function-calling API to modular planners/controllers (e.g., MoveTo, Pick, Place), responsible for navigation, manipulation, and scene-querying at the actuator level.

1.2 Agentic Data Synthesis

OWMM-Agent is enabled by a large-scale synthetic data generation pipeline in Habitat. This pipeline comprises:

  1. Episode Script Generation: PDDL templates define scene graphs, object/receptacle assignments, and pick-and-place objectives.
  2. Simulation Rollout: A Fetch robot executes navigation and manipulation, with RGB-D and state trajectories logged.
  3. Data Serialization: Textual instructions, per-step robot histories, and ground-truth action JSON are generated.
  4. Automatic Filtering/Augmentation: Invalid or unreachable steps are omitted; language variety is introduced through GPT-4o-driven paraphrasing.

This pipeline produces ∼\sim21k episodes totaling ∼\sim235k (image, history, action) samples, supporting high-scale instruction fine-tuning of the VLM.

1.3 Function Calling and Control

OWMM-VLM emits high-level actions as JSON, which are parsed and executed via function calls covering navigation, manipulation, and scene search. The mapping from visual predictions to world or image-frame coordinates for actions leverages explicit camera and depth geometry.

1.4 Experimental Results

Empirical performance is benchmarked on single-step and episodic manipulation tasks, with principal metrics:

  • Decision accuracy: Up to 97.85% (action classification), 87.54% (scene-frame selection).
  • Affordance grounding: Normalized mean distances for object (0.97), receptacle (0.94), and navigation points (0.88).
  • End-to-end episodic success: 21.9% (strict) and 51.5% (lenient) on standard OWMM benchmarks; substantial outperformance of GPT-4o and modular control agents.
  • Real-world zero-shot: 90% total single-step accuracy on a Fetch robot; task-specific sub-metrics consistently favor OWMM-VLM-38B over baseline LLM-centric approaches.

Ablation studies reveal that bounding box prediction, inclusion of chain-of-thought reasoning, and scaled data synthesis drive observed performance gains.

1.5 Strengths, Limitations, and Future Extension

OWMM-Agent achieves unified, multi-view multi-modal planning without heavy 3D mapping, supports high-level reasoning via text-based action summarization, and realizes strong zero-shot sim-to-real transfer. Known limits include required pre-mapping, embodiment dependence, and dexterity constraints (suction/simple grasp only). Future directions include tactile feedback integration, lifelong fine-tuning, dynamic scene re-mapping, and cross-embodiment adaptation.

2. OWMM-Agent for Ontology Working and Matching

OWMM-Agent, in the context of ontology matching, refers to an Ontology Working and Matching Multi-Agent architecture adopting the Agent-OM framework as detailed in (Qiang et al., 2023). The system orchestrates ontology alignment and entity matching through Siamese agent modules and prompt-orchestrated tools driven by a single LLM backbone.

2.1 System Architecture

OWMM-Agent consists of:

  • Retrieval Agent: Extracts, enriches, and stores semantic entity descriptions with multimodal enrichment (metadata, lexical, graphical), embedding entity information into a hybrid memory structure (PostgreSQL + vector database).
  • Matching Agent: Performs candidate search, ranking (via similarity scores and multi-evidence fusion), LLM-backed validation, and bidirectional merging to ensure matching consistency.
  • Shared Hybrid Memory: Separates relational metadata from vector-embedded content, serving as a central communication substrate.
  • Prompt-based Toolset: Implements extractors, retrievers, validators, summarizers, and merging utilities, callable via structured LLM prompts or API hooks.

2.2 Workflow and Matching Methodology

  • Entity Indexing: Entities from ontologies Os,OtO_s, O_t are indexed; embeddings Ï•(â‹…)\phi(\cdot) derived and stored for similarity-based retrieval.
  • Candidate Selection: For each Ht\mathcal{H}_t0 in Ht\mathcal{H}_t1, retrieve top-Ht\mathcal{H}_t2 candidates from Ht\mathcal{H}_t3 with highest cosine similarity.
  • Matching Pipeline: Applies hybrid database search, multi-channel matchers (initial, lexical, graphical), Reciprocal Rank Fusion (RRF), LLM-based validation, and result merging across Ht\mathcal{H}_t4 and Ht\mathcal{H}_t5 passes.
  • Probability Model: Optionally, a logistic regression model predicts matching likelihood given embedding features and evidence deltas:

Ht\mathcal{H}_t6

where Ht\mathcal{H}_t7 encodes evidence disparities.

2.3 Toolset and Integration

Tool coverage includes:

  • Ontology Parser (OWL API, rdflib)
  • SPARQL interface
  • Various prompt-wrapped retrievers and matchers
  • Hybrid memory storage and CRUD wrappers (PostgreSQL + pgvector)
  • RRF-based match summarization
  • LLM-based matching validation

The entire system executes via prompt templates or direct API invocation, compatible with frameworks such as LangChain.

2.4 Evaluation and Benchmarks

OWMM-Agent is evaluated on Ontology Alignment Evaluation Initiative (OAEI) benchmarks:

Track Year F1 Score (Rank) Notable Performance Patterns
Conference 2022 ≈0.83 (3/13) Few-shot, Type 1 naming, high precision
Conference 2023 ≈0.82 (5/12) –
Anatomy (Trivial) 2022 ≈0.95 (3/11) Comparable to deep learning systems
Anatomy (Trivial) 2023 ≈0.95 (4/10) –
Anatomy (Non-trivial) – ≈0.77 Outperforms 9/11 traditional OM systems
MSE Track Varies 0.80–0.88 (top-tier) Robust to subsumption noise, abbreviations, hierarchy

Precision and recall are defined as:

Ht\mathcal{H}_t8

where Ht\mathcal{H}_t9 is agent alignments and tt0 is the reference alignment.

2.5 Analysis of Few-Shot and Complex OM

OWMM-Agent for OM leverages retrieval-augmented generation (RAG), chain-of-thought (CoT) planning, and in-context exemplars, without requiring model fine-tuning. It demonstrates robustness to challenging settings such as few labeled alignments, deep hierarchies, noisy subsumption, and abbreviation-heavy domains by exploiting multi-evidence fusion and explicit bidirectional search/merging. Prompt engineering and coordinated agent memory ensure correct tool activation and alignment consistency across search directions.

3. Comparative Summary

Both instantiations of OWMM-Agent share characteristics of modular agent orchestration, hybrid memory architectures, structured toolsets for task decomposition, and reliance on foundation models for high-level reasoning. In open-world mobile manipulation, emphasis lies on cross-modal (vision-language) reasoning, hierarchical memory, policy-to-control abstraction, and large-scale synthetic data generation for fine-tuning (Chen et al., 4 Jun 2025). In ontology matching, the focus is on multi-agent prompt frameworks, evidence fusion, and LLM-backed semantic validation under minimal supervision (Qiang et al., 2023). Both achieve state-of-the-art or near state-of-the-art performance in their respective evaluation regimes, underscoring the generality and flexibility of agentic architectures built atop modern foundation models.

4. Limitations and Future Directions

OWMM-Agent systems face limitations particular to their domains—dependency on pre-mapped scene graphs and embodiment-specific fine-tuning (manipulation), or on prompt design and retrieval thresholds (ontology matching). Both approaches are amenable to enhancements via online adaptation, lifelong and cross-embodiment learning, incorporation of auxiliary sensory feedback, and further scaling of synthetic or real-world data. A plausible implication is that these agentic paradigms may generalize across additional domains as prompt-tool and memory interfaces mature.

5. References

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OWMM-Agent.