RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation (2505.01709v2)

Published 3 May 2025 in cs.RO, cs.AI, and cs.CV

Abstract: Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-LLM (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.

PDF Abstract

RoBridge: Bridging Cognition and Execution for Robotic Manipulation

The paper "RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation" addresses a fundamental challenge in robotics: enabling robots to operate effectively in open-ended environments with diverse tasks. Despite advances in NLP and multimodal models enhancing robots' comprehension of complex instructions, the procedural and declarative skill dilemmas remain significant hurdles. Traditional methods usually involve compromises between cognitive and executive capabilities, limiting robots' proficiency in dynamic settings.

RoBridge proposes a novel architecture integrating cognitive reasoning with physical execution via three main components: a high-level cognitive planner (HCP), an invariant operable representation (IOR), and a generalist embodied agent (GEA). This system bridges cognition and execution, maintaining declarative skills of vision-LLMs while harnessing procedural skills through reinforcement learning.

Architecture and Methodology

RoBridge's architecture is articulated in three layers:

High-Level Cognitive Planner (HCP): This component uses large vision-LLMs alongside APIs to split tasks into primitive actions. It facilitates high-level planning, enabling the vision-LLM to generate intuitive symbolic representations rather than direct motor commands—addressing the declarative skill dilemma.
Invariant Operable Representation (IOR): IOR is produced by the HCP and serves as an abstraction bridging cognitive and executive domains. It includes masked depth and third-view masks of involved objects, which are invariant under environmental changes. This representation allows generalization across different settings, alleviating procedural skill dilemmas.
Generalist Embodied Agent (GEA): The GEA converts IOR into execution actions through reinforcement and imitation learning, ensuring the completion of tasks despite external interruptions. The adaptive sampling DAgger and domain randomization reinforce its robustness across varied domains.

Results and Implications

RoBridge exhibits remarkable improvement over existing methods, achieving a 75% success rate on new tasks and 83% in sim-to-real generalization with minimal real-world data samples. This signifies a meaningful stride in robotic systems integrating cognitive reasoning with physical execution.

The architecture's modularity and the universality of IOR imply practical adaptability across diverse tasks and environments. The utilization of large-scale pre-trained models augments the system with better task comprehension, common in real-world scenarios.

Future Perspectives

While RoBridge significantly advances robotic manipulation, enhancements in its visual understanding and action precision across modules will further refine it. Expanding the system's capabilities to interact with varied object shapes, sizes, and material properties will bolster its utility in real-world applications.

Overall, RoBridge spearheads a new paradigm in robotic manipulation—integrating the strengths of multimodal models and reinforcement learning. This fusion holds the potential to revolutionize how robots perceive, plan, and act in complex environments, paving the way for future breakthroughs in AI-driven robotics.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Kaidong Zhang (15 papers)
Rongtao Xu (34 papers)
Pengzhen Ren (15 papers)
Junfan Lin (11 papers)
Hefeng Wu (35 papers)
Liang Lin (318 papers)
Xiaodan Liang (318 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/dylan_curious/status/1919814422748406263

YouTube

Show All Videos