SARO: Space-Aware Robot System for Terrain Crossing via Vision-Language Model (2407.16412v3)

Published 23 Jul 2024 in cs.RO

Abstract: The application of vision-LLMs (VLMs) has achieved impressive success in various robotics tasks. However, there are few explorations for these foundation models used in quadruped robot navigation through terrains in 3D environments. In this work, we introduce SARO (Space Aware Robot System for Terrain Crossing), an innovative system composed of a high-level reasoning module, a closed-loop sub-task execution module, and a low-level control policy. It enables the robot to navigate across 3D terrains and reach the goal position. For high-level reasoning and execution, we propose a novel algorithmic system taking advantage of a VLM, with a design of task decomposition and a closed-loop sub-task execution mechanism. For low-level locomotion control, we utilize the Probability Annealing Selection (PAS) method to effectively train a control policy by reinforcement learning. Numerous experiments show that our whole system can accurately and robustly navigate across several 3D terrains, and its generalization ability ensures the applications in diverse indoor and outdoor scenarios and terrains. Project page: https://saro-vlm.github.io/

Citations (1)

View on Semantic Scholar

Summary

The paper introduces SARO, a system that integrates vision-language informed task decomposition for autonomous terrain crossing.
It combines high-level reasoning with a reinforcement learning-based Probability Annealing Selection to refine motion planning and control.
Experimental results show improved navigation success, achieving up to 100% success on stairs and grassland compared to existing baselines.

The paper presents "Cross Anything System" (CAS), an innovative system designed for autonomous navigation of quadruped robots in complex 3D terrains. This system integrates a high-level reasoning module, leveraging vision-LLMs (VLMs), with a low-level control policy dubbed Probability Annealing Selection (PAS). Key contributors to the field include researchers from Shanghai Qi Zhi Institute, Zhejiang University, Shanghai Jiao Tong University, and Tsinghua University. The overarching aim is to enhance the robot's capability to navigate autonomously in both indoor and outdoor environments with various challenging terrains.

Main Contributions

Cross Anything System (CAS)
- High-level Reasoning and Motion Planning: CAS leverages a zero-shot VLM for task decomposition and motion planning. This component uses ego-view images to break down complex navigation tasks into manageable sub-tasks. The subtasks are then executed in a closed-loop manner for robust performance.
- Auxiliary Modules: These complement the VLM, including localization and trajectory refinement modules, enhancing the overall situational awareness and precision in motion execution.
Probability Annealing Selection (PAS)
- Reinforcement Learning-Based Locomotion Control: The PAS method trains the control policy using reinforcement learning. It tackles the sim-to-real transfer problem by gradually annealing the use of privileged information during training, thus ensuring robustness in real-world deployments.

Experimental Results

Experiments were conducted on the Unitree A1 quadruped robot equipped with NVIDIA Jetson Xavier NX. The trials were performed in versatile routes across stairs, ramps, gaps, and doors. CAS demonstrated superior performance and robustness compared to other methods such as NoMaD, ViNT, and LSTM-based baselines. Some noteworthy results include:

Stairs: CAS achieved an overall success rate of 60%, whereas NoMaD and LSTM scored 0%.
Gaps: CAS attained a 45% overall success rate, outperforming the nearest competitor by a significant margin.

Low-Level Locomotion Control

The PAS control policy was rigorously tested in simulation and real-world settings. Metrics involved the success rate and velocity tracking ratio. Key findings:

Simulation Results: CAS achieved an 85.31% success rate on average, surpassing previous methods such as RMA and IL.
Real-World Results: In real-world tests involving stairs, ramps, rubble, grassland, and unseen obstacles, CAS consistently exhibited high success rates, with particularly strong performance on stairs (100% success rate) and grassland (100% success rate).

Implications and Future Directions

The implications of this research are multifaceted. CAS demonstrates that integrating high-level vision-LLMs with a robust low-level control policy can significantly enhance the navigational capabilities of quadruped robots. These findings may have practical applications in industries where autonomous navigation in complex environments is crucial, such as search and rescue operations, inspection tasks, and agricultural robotics.

Theoretically, the successful implementation of a VLM-based task decomposition and motion planning system signals a substantial step forward. It highlights the potential of VLMs to contribute beyond traditional vision tasks, extending into dynamic and adaptable robotic navigation.

Future developments could see the integration of advanced perception and localization methods to address the current limitations associated with high-frequency vibrations affecting IMU data. Additionally, incorporating memory mechanisms like topological or semantic maps could further enhance the system's reliability and efficiency in diverse settings.

Overall, the paper signifies an important advancement in robotics, showcasing how foundational models can be practically implemented to tackle real-world problems in quadruped robot navigation. The CAS system sets a solid groundwork for future exploration and optimization in this domain.