- The paper introduces AeroAgent, a dual-layer system that separates high-level strategic planning (cerebrum) from precise low-level execution (cerebellum) on drones.
- The paper leverages a multimodal memory database and ROSchain framework to integrate sensory data with few-shot learning and standardized ROS APIs.
- The paper demonstrates enhanced performance in simulations—such as wildfire rescue, vision-based landing, and infrastructure inspection—outperforming traditional DRL-based methods.
Agent as Cerebrum, Controller as Cerebellum: Implementing an Embodied LMM-based Agent on Drones
Overview
The paper proposes a framework named AeroAgent that integrates an agent-based large multimodal model (LMM) with an autonomous control system for industrial drones. The concept combines the principles of agent as cerebrum
and controller as cerebellum
to enhance drone efficiency and performance in complex, real-world tasks. The architecture leverages an agent for high-level decision-making and a controller for executing precise, low-level actions.
Core Contributions
The paper makes several significant contributions:
- Novel Architecture: AeroAgent's unique agent-controller paradigm is operationalized on industrial drones, featuring separate roles for high-level planning (cerebrum) and low-level execution (cerebellum). This separation enhances task-specific performance and overall system stability.
- Embodied Agent Based on LMMs: Unlike traditional LLM-based agents, the proposed architecture exhibits superior integration with robotic systems, leveraging the multimodal capabilities of LMMs for better situational understanding and decision-making.
- ROSchain Framework: ROSchain enables seamless integration of LMM-based agents with the Robot Operating System (ROS), facilitating communication and execution across sensory and actuator modules through a set of standardized APIs and modules.
Technical Approach
The AeroAgent architecture is comprised of three integral components:
- Automatic Plan Generator: Utilizes LMMs to process multimodal inputs and generate high-level, open-ended plans. The inputs include natural language descriptions of tasks, sensor data (images, audio, location), and memory database retrievals.
- Multimodal Memory Database: Stores multimodal task-related memories and contextual data, supporting few-shot learning and dynamic task adaptation. It enhances the agent’s ability to retrieve pertinent information effectively.
- Embodied Action Library: Curates specific actions tailored to the drone's payload and operational requirements, ensuring the compatibility of action commands with actuator capabilities.
Capabilities and Performance
The capabilities of AeroAgent were evaluated using real-world-inspired scenarios within a high-fidelity simulation environment (AirGen), which offers a realistic assessment due to its comprehensive representation of physical and environmental conditions. The scenarios included wildfire search and rescue, vision-based landing, infrastructure inspection, and safe navigation.
Wildfire Search and Rescue: AeroAgent demonstrated superior performance in generating coordinated search plans, real-time updates, and emergency response strategies compared to baseline DRL agents and single-call LMM approaches.
Vision-Based Landing: It achieved precise landings by hierarchically processing visual inputs and executing fine-grained maneuvers, outperforming both the DRL-based agents and single-call LMM methods.
Infrastructure Inspection: The agent effectively identified faults and reported detailed assessments, leveraging multimodal memory for environmental context recognition.
Safe Navigation: AeroAgent showed enhanced exploration capabilities, dynamically adapting to new environments and avoiding obstacles efficiently.
Practical Implications and Future Developments
The practical implications of this research extend to various industrial applications where autonomous drones require enhanced decision-making and operational efficiency. This includes scenarios such as emergency response, industrial inspections, and autonomous transport systems.
Future work should focus on extending the complexity and diversity of tasks to fully validate AeroAgent's real-world applicability. This may involve longer-duration missions, the inclusion of additional sensor and actuator types, and enhanced robustness of the system architecture to ensure reliability in diverse operational settings.
Conclusion
By presenting a novel, LMM-based approach for autonomous drone operations, the authors highlight significant advancements in the field of embodied intelligence. The AeroAgent framework, complemented by the ROSchain integration, shows substantial promise for improving the autonomy and effectiveness of industrial drones across various application domains. Future developments are expected to refine and expand the system’s capabilities, providing robust solutions for real-world challenges.