Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones (2311.15033v1)

Published 25 Nov 2023 in cs.RO and cs.AI

Abstract: In this study, we present a novel paradigm for industrial robotic embodied agents, encapsulating an 'agent as cerebrum, controller as cerebellum' architecture. Our approach harnesses the power of Large Multimodal Models (LMMs) within an agent framework known as AeroAgent, tailored for drone technology in industrial settings. To facilitate seamless integration with robotic systems, we introduce ROSchain, a bespoke linkage framework connecting LMM-based agents to the Robot Operating System (ROS). We report findings from extensive empirical research, including simulated experiments on the Airgen and real-world case study, particularly in individual search and rescue operations. The results demonstrate AeroAgent's superior performance in comparison to existing Deep Reinforcement Learning (DRL)-based agents, highlighting the advantages of the embodied LMM in complex, real-world scenarios.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces AeroAgent, a dual-layer system that separates high-level strategic planning (cerebrum) from precise low-level execution (cerebellum) on drones.
The paper leverages a multimodal memory database and ROSchain framework to integrate sensory data with few-shot learning and standardized ROS APIs.
The paper demonstrates enhanced performance in simulations—such as wildfire rescue, vision-based landing, and infrastructure inspection—outperforming traditional DRL-based methods.

Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones

Overview

The paper proposes a framework named AeroAgent that integrates an agent-based large multimodal model (LMM) with an autonomous control system for industrial drones. The concept combines the principles of agent as cerebrum and controller as cerebellum to enhance drone efficiency and performance in complex, real-world tasks. The architecture leverages an agent for high-level decision-making and a controller for executing precise, low-level actions.

Core Contributions

The paper makes several significant contributions:

Novel Architecture: AeroAgent's unique agent-controller paradigm is operationalized on industrial drones, featuring separate roles for high-level planning (cerebrum) and low-level execution (cerebellum). This separation enhances task-specific performance and overall system stability.
Embodied Agent Based on LMMs: Unlike traditional LLM-based agents, the proposed architecture exhibits superior integration with robotic systems, leveraging the multimodal capabilities of LMMs for better situational understanding and decision-making.
ROSchain Framework: ROSchain enables seamless integration of LMM-based agents with the Robot Operating System (ROS), facilitating communication and execution across sensory and actuator modules through a set of standardized APIs and modules.

Technical Approach

The AeroAgent architecture is comprised of three integral components:

Automatic Plan Generator: Utilizes LMMs to process multimodal inputs and generate high-level, open-ended plans. The inputs include natural language descriptions of tasks, sensor data (images, audio, location), and memory database retrievals.
Multimodal Memory Database: Stores multimodal task-related memories and contextual data, supporting few-shot learning and dynamic task adaptation. It enhances the agent’s ability to retrieve pertinent information effectively.
Embodied Action Library: Curates specific actions tailored to the drone's payload and operational requirements, ensuring the compatibility of action commands with actuator capabilities.

Capabilities and Performance

The capabilities of AeroAgent were evaluated using real-world-inspired scenarios within a high-fidelity simulation environment (AirGen), which offers a realistic assessment due to its comprehensive representation of physical and environmental conditions. The scenarios included wildfire search and rescue, vision-based landing, infrastructure inspection, and safe navigation.

Wildfire Search and Rescue: AeroAgent demonstrated superior performance in generating coordinated search plans, real-time updates, and emergency response strategies compared to baseline DRL agents and single-call LMM approaches.

Vision-Based Landing: It achieved precise landings by hierarchically processing visual inputs and executing fine-grained maneuvers, outperforming both the DRL-based agents and single-call LMM methods.

Infrastructure Inspection: The agent effectively identified faults and reported detailed assessments, leveraging multimodal memory for environmental context recognition.

Safe Navigation: AeroAgent showed enhanced exploration capabilities, dynamically adapting to new environments and avoiding obstacles efficiently.

Practical Implications and Future Developments

The practical implications of this research extend to various industrial applications where autonomous drones require enhanced decision-making and operational efficiency. This includes scenarios such as emergency response, industrial inspections, and autonomous transport systems.

Future work should focus on extending the complexity and diversity of tasks to fully validate AeroAgent's real-world applicability. This may involve longer-duration missions, the inclusion of additional sensor and actuator types, and enhanced robustness of the system architecture to ensure reliability in diverse operational settings.

Conclusion

By presenting a novel, LMM-based approach for autonomous drone operations, the authors highlight significant advancements in the field of embodied intelligence. The AeroAgent framework, complemented by the ROSchain integration, shows substantial promise for improving the autonomy and effectiveness of industrial drones across various application domains. Future developments are expected to refine and expand the system’s capabilities, providing robust solutions for real-world challenges.