ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations (2505.10911v1)

Published 16 May 2025 in cs.RO

Abstract: We introduce ReWiND, a framework for learning robot manipulation tasks solely from language instructions without per-task demonstrations. Standard reinforcement learning (RL) and imitation learning methods require expert supervision through human-designed reward functions or demonstrations for every new task. In contrast, ReWiND starts from a small demonstration dataset to learn: (1) a data-efficient, language-conditioned reward function that labels the dataset with rewards, and (2) a language-conditioned policy pre-trained with offline RL using these rewards. Given an unseen task variation, ReWiND fine-tunes the pre-trained policy using the learned reward function, requiring minimal online interaction. We show that ReWiND's reward model generalizes effectively to unseen tasks, outperforming baselines by up to 2.4x in reward generalization and policy alignment metrics. Finally, we demonstrate that ReWiND enables sample-efficient adaptation to new tasks, beating baselines by 2x in simulation and improving real-world pretrained bimanual policies by 5x, taking a step towards scalable, real-world robot learning. See website at https://rewind-reward.github.io/.

PDF Abstract

Essay on "ReWiND: Language-Guided Rewards Teach Robot Policies without New Demonstrations"

The paper presents a novel framework named ReWiND, which stands for Language-Guided Rewards for Teaching Robot Policies without New Demonstrations. The framework addresses the challenge of robot manipulation tasks by leveraging language instructions to learn tasks, notably eliminating the need for demonstrations specific to each task. The traditional methods reliant on reinforcement learning (RL) and imitation learning necessitate expert oversight through custom-designed reward functions or demonstrations for new tasks. ReWiND distinguishes itself by using a minimal initial demonstration dataset to inform a language-conditioned reward function and pre-train a policy with offline RL.

The ReWiND framework consists of three distinct phases. Firstly, it trains a reward function using a small dataset of demonstrations, which conditions on language instructions to estimate per-timestep rewards. This reward model is trained to generalize effectively to unseen tasks, providing dense and informative feedback, even in cases of policy failures. A unique aspect of ReWiND is the video rewinding technique, which generates failure trajectories from successful demonstrations, aiding the reward model in delivering consistent feedback during online policy adaptation.

In the second phase, ReWiND uses this reward function to pre-train a language-conditioned policy offline. This pre-training is performed without the need for new task-specific demonstrations and employs an expectile regression strategy to optimize the value functions effectively.

The final stage involves fine-tuning the pre-trained policy to adapt to new tasks, informed by the reward labels produced in real-time by the trained model. This method ensures sample-efficient adaptation to new environments, demonstrating a significant reduction in online interactions required compared to baseline approaches.

The results presented in the paper indicate that ReWiND performs well across various metrics. The reward model generalizes to unseen tasks with improvements of up to 2.4 times in reward generalization and policy alignment metrics against baseline models. In complex simulated environments like Meta-World, ReWiND achieved an improvement of up to 2 times in simulation and up to 5 times in real-world bimanual robotic policies.

The use of instruction generation via LLMs further enhances the robustness of ReWiND to diverse language inputs. The model's ability to correlate rewards with task progression and its application of augmented language instructions illustrate its robustness and flexibility. This approach presents implications for scalable real-world robot learning by reducing the dependency on manually intensive supervisory input.

For the broader field of AI and robotics, ReWiND signals a methodological shift towards automating the learning of new tasks without the prerequisite of extensive demonstrations. Practically, this research can aid in deploying robots in variable environments efficiently, minimizing time and cost. Theoretically, it underpins the growing importance of integrating LLMs with RL to develop autonomous systems that can learn from limited human input.

Looking forward, potential developments building on ReWiND might involve extending the capabilities of the reward model by integrating broader datasets and pre-trained models to cover more diverse robotic applications. Additionally, integrating more advanced policy architectures capable of handling complex tasks could further enhance the practical deployment of ReWiND in dynamic and unpredictable environments.