Modular AR Agent System
- Modular AR Agent System is an AR architecture that decouples perception, reasoning, memory, and action into independent, upgradable modules.
- It employs inter-module communication through persistent memory, event queues, and defined API contracts to integrate diverse sensor data.
- The system is designed for scalability and extensibility, supporting rapid model updates, multi-modal interfacing, and real-time operation.
A Modular AR Agent System is an augmented reality (AR) software architecture that decomposes perception, reasoning, behavior, and user interaction into loosely coupled, independently upgradable modules. This design paradigm enables rapid adaptation to new sensors, models, and user tasks, supporting integration of multiple modalities (vision, language, audio, affect) and scalable real-time operation under diverse interaction contexts. Prominent modular AR agent systems in recent research include frameworks for embodied robotic agents (Pratik et al., 2021), emotion-aware companions (Xi et al., 12 Aug 2025), task-adaptive spatial retrieval (Guo et al., 29 Nov 2025), and agentic reasoning for multi-agent orchestration (Yao et al., 7 Oct 2025). Core to these systems is the abstraction and isolation of agent skills as plug-and-play perceptual, cognitive, memory, and action modules unified through persistent memory stores, event queues, and defined API contracts.
1. Core Architectural Principles
All leading modular AR agent systems are characterized by the following principles:
- Functional Decomposition: Cognitive operations are split among specialized modules (e.g., perception, dialogue, memory, behavior orchestration), each with well-specified I/O signatures. Modules can be independently replaced or scaled, minimizing mutual dependencies (Pratik et al., 2021, Xi et al., 12 Aug 2025, Guo et al., 29 Nov 2025).
- Inter-Module Communication: Persistent memory (relational tables or blackboard systems), event/task queues, and publish/subscribe event buses are used for decoupled data and command exchange across modules (Pratik et al., 2021, Xi et al., 12 Aug 2025).
- Plug-and-Play Tooling: New detectors, semantic parsers, actuation routines, or reasoning subroutines can be registered by implementing or inheriting the requisite interface—enabling seamless integration of emerging models (e.g., MLLMs, vision transformers) (Guo et al., 29 Nov 2025).
- Cross-Modal Interfacing: Modules ingest and output data from diverse sources: audio, text, RGB/depth scenes, gestures, ego pose, calendar/contextual data (Xi et al., 12 Aug 2025, Guo et al., 29 Nov 2025).
- Task-Driven Orchestration: A controller agent (or meta-policy) routes, prioritizes, and executes behaviors according to live world state and task context, supporting both reactive and proactive agent behaviors (Pratik et al., 2021, Xi et al., 12 Aug 2025, Yao et al., 7 Oct 2025, Guo et al., 29 Nov 2025).
- Continuous Learning & Adaptation: Architectures support self-supervision (cross-modal co-occurrence logging), user corrections, and reward-driven module evolution for long-term adaptability (Pratik et al., 2021, Yao et al., 7 Oct 2025).
2. Module Taxonomy and Data Interfaces
Exemplary Module Types
| Module Type | Typical I/O Signature | Example Systems |
|---|---|---|
| Perception (Vision, Audio) | Image/Audio/Frame Features/Detections/Emotion labels | (Pratik et al., 2021, Xi et al., 12 Aug 2025, Guo et al., 29 Nov 2025) |
| Memory System | Query/Write (filter φ) MemoryNode(s) | (Pratik et al., 2021, Xi et al., 12 Aug 2025) |
| Dialogue / Intent | Utterance/Text Parsed Structure/Action Plan | (Pratik et al., 2021, Xi et al., 12 Aug 2025, Guo et al., 29 Nov 2025) |
| Task Executors | TaskSpec × Observations Updated TaskState/Output | (Pratik et al., 2021, Guo et al., 29 Nov 2025) |
| Behavior Orchestrator | Events, Memory, Policies Command Routing/Request | (Xi et al., 12 Aug 2025, Guo et al., 29 Nov 2025) |
| Compression/Pruning | Memory Buffer Compressed/Filtered Memory | (Xi et al., 12 Aug 2025) |
| Meta-Policy/Reasoning ARM | Problem/History Next Step/Solution | (Yao et al., 7 Oct 2025) |
All modules must publish their input/output data specification, allowing the rest of the system to remain agnostic to internal implementation.
Example: Perception modules in droidlet output nodes of form , with deduplication and time-based persistence in the shared memory (Pratik et al., 2021).
3. Memory, Persistence, and Compression
Shared memory serves as the foundational data bus across modules, storing facts, event logs, detections, and task descriptors. Two notable patterns exist:
- Relational/Graph Memory: Structured as nodes and edges; supports fast logical queries (e.g., ) (Pratik et al., 2021, Guo et al., 29 Nov 2025).
- Progressive Compression: Long-term storage is managed via algorithms such as Temporal Binary Compression (TBC), which groups conversation turns or events into epochs and recursively summarizes within pairs; and Dynamic Importance Memory Filter (DIMF), which assigns context-dependent importance scores and prunes entries below a percentile threshold, maximizing retention of salient context (Xi et al., 12 Aug 2025).
Memory is continually updated by perception and interaction modules, supports semantic/keyword retrieval, and can be queried by downstream behavior modules for context-aware operation.
4. Orchestration, Reasoning, and Meta-Policies
High-level control of agent behavior is mediated by prioritized queues or meta-policy agents:
- Task and Dialogue Queues: All incoming commands and dialogue are queued as objects, with explicit stepping and state transitions. Task objects encapsulate action type, parameters, and priority; dialogue objects parse, interpret, and respond, often invoking neural semantic parsers or DSL interpreters (Pratik et al., 2021, Xi et al., 12 Aug 2025).
- Meta-Policy & Reasoning Modules ("ARMs"): Agentic Reasoning Modules generalize Chain-of-Thought reasoning by encapsulating each reasoning step as a modular Python program, potentially itself a micro-multi-agent system. Automatic evolution and selection of ARMs are conducted via reflection-guided evolutionary search over program space, yielding modules that outperform static, hand-coded baselines and retain cross-domain robustness (Yao et al., 7 Oct 2025).
- Plug-and-Play Reasoning Tools: Scene graph builders, relation inference algorithms, or measurement tools are invoked dynamically based on parsed user intent and relevance within the world model (Guo et al., 29 Nov 2025).
The orchestration layer is responsible for selecting which modules are invoked for a specific task, ensuring both modular reuse and adaptive behavior.
5. Multi-Modal Extension and AR Integration
Augmented reality imposes requirements for real-time sensor fusion, responsive feedback, and embodied interaction. Leading modular AR agent systems deploy the following strategies:
- Perception-Action Integration: Multi-modal inputs (RGB, depth, IMU, speech, gesture) are processed by dedicated modules and fused via the world model. For spatial tasks, 2D detections are lifted to 3D anchors using calibration matrices () and pose transforms () (Guo et al., 29 Nov 2025).
- Dynamic AR Scene Graph: Scene graphs encode spatial, structural, functional, and causal relations among physical and virtual objects, enabling queries like "Which box is behind the toolbox but closer to me than the printer?" with relational and geometric disambiguation (Guo et al., 29 Nov 2025).
- Real-Time Synchronization: Frame loops synchronize backend decisions with front-end AR rendering at rates up to 30 FPS and response generation latencies below 100 ms for dialogue (Xi et al., 12 Aug 2025, Guo et al., 29 Nov 2025).
- Overlay & Feedback: Region-of-interest highlighting, haptic cues, and interactive clarification loops bridge human and agentic understanding and support in-the-loop correction (Guo et al., 29 Nov 2025).
- Extensibility: The modular framework supports plug-in addition of gesture recognizers, new reasoning agents, or alternative memory backends with no global refactoring (Xi et al., 12 Aug 2025, Guo et al., 29 Nov 2025).
6. Evaluation, Scalability, and Empirical Results
Robustness and scalability of modular AR agent systems are demonstrated via dedicated benchmarks, ablation studies, and reproducible metrics:
- Performance Metrics: 3D localization error (e.g., 5.4 cm 2.1, Success@10 cm = 88.7%) (Guo et al., 29 Nov 2025), emotion recognition accuracy (88% multimodal) (Xi et al., 12 Aug 2025), dialogue/task success rates, memory compression ratios (e.g., 70% reduction with >90% recall for salient memories) (Xi et al., 12 Aug 2025), and real-time throughput measurements.
- Benchmarks: GroundedAR-Bench covers localization, relational grounding, and end-to-end spatial retrieval under variable scene complexity and clutter (Guo et al., 29 Nov 2025). Reasoning agents are evaluated on math, science, and live-updating QA benchmarks (Yao et al., 7 Oct 2025).
- Comparative Results: Modular systems consistently outperform monolithic or static agent baselines, particularly when integrating plug-and-use MLLMs or evolving high-performing reasoning blocks via ARM/meta-policy discovery (Yao et al., 7 Oct 2025, Guo et al., 29 Nov 2025).
- Scalability via Decoupling: Modules can be moved across hardware (GPU, CPU), resized for performance, or run at different frequencies without impacting correctness (Pratik et al., 2021).
7. Extensibility, Personalization, and Future Directions
The modular paradigm enables:
- Personalization: Agents adapt interaction style and context by swapping personality archetypes in prompt templates, integrating user feedback into dynamic memory scoring, and supporting user-selected behavior policies (Xi et al., 12 Aug 2025).
- Multi-Agent and Multi-User Extensions: New agent modules (e.g., gesture understanding, group memory) or additional user-specific orchestrators are instantiated via event bus registration (Xi et al., 12 Aug 2025). Task coordination among collaborative AR agents is routinely managed via shared memory/state and relation-augmented scene graphs (Guo et al., 29 Nov 2025).
- Hardware Adaptation: Subsets of modules can be deployed on-device (e.g., for haptic AR companions), with intensive models offloaded to cloud-based execution (Xi et al., 12 Aug 2025).
- Continuous Learning and Annotation: Live correction and interactive annotation are natively supported for real-world data collection, boosting sample efficiency and rapid deployment in evolving environments (Pratik et al., 2021).
- Plug-in Model Upgrades: Any model that implements the core API contract (e.g., detector, MLLM, relation-proposer) can be swapped in-place, facilitating rapid leveraging of state-of-the-art models without retraining or refactoring (Guo et al., 29 Nov 2025).
A plausible implication is that such modular frameworks will underpin scalable, adaptable AR applications across domains ranging from embodied robotics to emotionally intelligent companions.
References:
- (Pratik et al., 2021) droidlet: modular, heterogenous, multi-modal agents
- (Xi et al., 12 Aug 2025) Livia: An Emotion-Aware AR Companion Powered by Modular AI Agents and Progressive Memory Compression
- (Yao et al., 7 Oct 2025) ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
- (Guo et al., 29 Nov 2025) Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR