Interaction between high-level agents and low-level vision-language-action executors
Develop interaction methods between the high-level planning agent in RoboMemory (the Closed-Loop Planning Module) and low-level vision-language-action executors (e.g., pi_0) that utilize modalities beyond natural language to communicate fine-grained action details such as grasp points, which are difficult to specify via language-only instructions.
References
A key unsolved problem in current hierarchical agent research for embodied tasks, including ours, is about the interaction between high-level agents and low-level executors (e.g., VLAs). Most existing frameworks use language instructions solely as action instructions from high-level agents. However, some action details are hard to describe with language. Other modalities (e.g., vision) can better represent these details (e.g., grasp points).