Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System (2506.05020v2)

Published 5 Jun 2025 in cs.RO and cs.AI

Abstract: Heterogeneous multi-robot systems show great potential in complex tasks requiring hybrid cooperation. However, traditional approaches relying on static models often struggle with task diversity and dynamic environments. This highlights the need for generalizable intelligence that can bridge high-level reasoning with low-level execution across heterogeneous agents. To address this, we propose a hierarchical framework integrating a prompted LLM and a GridMask-enhanced fine-tuned Vision LLM (VLM). The LLM decomposes tasks and constructs a global semantic map, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning. Within this framework, the aerial robot follows an optimized global semantic path and continuously provides bird-view images, guiding the ground robot's local semantic navigation and manipulation, including target-absent scenarios where implicit alignment is maintained. Experiments on real-world cube or object arrangement tasks demonstrate the framework's adaptability and robustness in dynamic environments. To the best of our knowledge, this is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.

PDF Abstract

Hierarchical LLMs for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System

The paper entitled "Hierarchical LLMs for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System" introduces a hierarchical framework integrating LLMs and fine-tuned Vision LLMs (VLMs) to address the challenges faced by heterogeneous multi-robot systems in dynamic environments. This approach aims to enhance task decomposition, semantic navigation, and manipulation in combined robotic systems, specifically focusing on aerial-ground cooperation.

Heterogeneous Multi-Robot Systems (HMRS) comprise diverse agents such as aerial, ground, and underwater robots, each contributing unique capabilities to handle complex tasks. Although traditionally employed methods have relied on static models or predefined behaviors, these approaches often fall short when confronted with dynamic and unforeseen circumstances. This paper proposes a three-layer hierarchical framework where LLMs are tasked with high-level reasoning and task decomposition, while VLMs focus on providing detailed semantic labels and spatial information necessary for local execution.

Framework Overview

Reasoning Layer: This layer leverages LLMs to breakdown user-provided high-level instructions into sub-tasks tailored to each robotic agent's capabilities. It performs task decomposition, motion function mapping, and constructs a global semantic map from aggregated aerial observations. The hierarchical decomposition allows the system to adaptively assign tasks across robots and maintain coordination in dynamically changing environments.

Perception Layer: A GridMask-enhanced VLM is responsible for extracting semantic labels and 2D spatial representations from aerial images. It provides the perceptual grounding necessary for semantic-aware manipulation and navigation tasks. This involves classifying objects based on their relevance to the task, maintaining semantic coherence, and supporting the local planning process through real-time observations.

Execution Layer: Situated at the lowest level, this layer executes pre-programmed motion functions derived from higher-level instructions and semantic insights provided by the preceding layers. The aerial robot functions as a global path planner, using a leader-follower mechanism to guide the ground robot, which performs local navigation and manipulation tasks ensuring alignment with the aerial robot's optimized semantic path.

Experimental Evaluation

Experiments conducted using a Unitree Go1 quadruped robot and a custom quadrotor demonstrate the framework's effectiveness in real-world settings. It validated the system's ability to accurately execute task directives such as object relocation and assembly operations. Results showed a high success rate in task decomposition and completion, achieving a task decomposition accuracy rate of up to 100% in certain conditions and demonstrating robust semantic navigation.

The paper provides empirical evidence highlighting the advantages of the GridMask-based perceptual strategy, which significantly enhances spatial accuracy, achieving more precise semantic understanding when compared to baseline models. This approach enables reliable object detection and consistent spatial reasoning necessary for orchestrating multi-agent robotic systems in practical scenarios.

Implications and Future Directions

The framework exemplifies a promising progression in integrating reasoning, perception, and execution capabilities within HMRS, pioneering advancements in semantic navigation and manipulation. Its hierarchical structure and modular separation of task decomposition, planning, and execution set a foundational framework for developing generalizable intelligence systems capable of bridging high-level reasoning with low-level execution.

While the framework demonstrates robust performance in controlled task settings, future enhancements should address 3D motion planning for aerial robots in cluttered and unstructured environments. Additionally, refinement of multi-agent coordination strategies and broader reasoning capabilities would extend its applicability to complex tasks such as search-and-rescue missions and large-scale industrial automation.

This paper contributes significantly to the ongoing exploration of multi-agent systems, setting a trajectory for developing more adaptive, responsive, and intelligent robotic networks suitable for dynamic, real-world applications.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Haokun Liu (26 papers)
Zhaoqi Ma (2 papers)
Yunong Li (1 paper)
Junichiro Sugihara (5 papers)
Yicheng Chen (24 papers)
Jinjie Li (12 papers)
Moju Zhao (16 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/smellslikeml/status/1930991877882548357

YouTube

Show All Videos