Residual Off-Policy RL for Finetuning Behavior Cloning Policies (2509.19301v2)

Published 23 Sep 2025 in cs.RO and cs.LG

Abstract: Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world.

Summary

The paper introduces ResFiT, a residual RL method that fine-tunes immutable behavior cloning policies for enhanced robotic visuomotor control.
It combines behavior cloning with action chunking and off-policy RL to achieve superior sample efficiency and near-perfect success rates across tasks.
Experiments in simulation and real-world settings demonstrate significant performance gains in complex manipulation tasks, paving the way for autonomous robotic control.

Residual Off-Policy RL for Finetuning Behavior Cloning Policies

Introduction

This paper addresses the challenges faced by behavior cloning (BC) and reinforcement learning (RL) in training visuomotor control policies for robots. While BC requires extensive manual data collection and suffers from performance saturation, RL struggles with sample inefficiency and safety in real-world scenarios, particularly in systems with high degrees of freedom (DoF). The authors propose a novel approach that synergistically leverages BC and RL through a residual learning framework, introducing the off-policy residual fine-tuning (\textit{ResFiT}) method. This approach uses BC policies as immutable foundations and applies lightweight, sample-efficient off-policy RL to learn corrections, enabling robotics applications in both simulation and reality, demonstrated by training on a humanoid robot.

Methodology

The method presented (\textit{ResFiT}) employs a two-phase process: an initial BC phase followed by off-policy RL-based fine-tuning. This approach involves:

Behavior Cloning (BC) with Action Chunking: BC policies are trained on demonstration data, predicting sequences of actions to mitigate compounding errors. Action chunking reduces task horizons and improves imitation learning.
Off-Policy Residual Reinforcement Learning (RL): The core novelty lies in training a residual policy to refine base actions from BC policies rather than learning from scratch. This is facilitated by enhanced sample efficiency achievable by reparameterizing conventional Q-learning techniques for residual settings.

The paper includes a comprehensive algorithm for \textit{ResFiT}, detailing the interplay of BC and RL techniques to achieve optimal fine-tuning performance.

Figure 1: Our simulation tasks from Robomimic and DexMimicGen, spanning single-arm manipulation as well as bimanual coordination tasks.

Experimental Evaluation

Simulation Results

The authors conducted extensive experiments across tasks of varying complexity, from single-arm to bimanual manipulation, validating the efficacy of \textit{ResFiT}. In these evaluations, \textit{ResFiT} significantly outperformed existing methods by achieving near-perfect success rates across several sophisticated tasks. Key findings include:

Sample Efficiency: \textit{ResFiT} demonstrated superior sample efficiency compared to on-policy methods like PPO, achieving similar results with far fewer interactions, enhancing viability for real-world applications.
Task Performance: Across multiple tasks, \textit{ResFiT} consistently achieved higher success rates over off-policy RL baselines, such as RLPD, and other advanced BC methods.
Figure 2: Success rates of different approaches on our simulation tasks, showing \textit{ResFiT} converging to high-performing policies for all tasks.

Real-World Results

In real-world testing on bimanual dexterous manipulation tasks with complex humanoids, \textit{ResFiT} increased task success rates substantially. Specific tasks, such as picking and handing over objects with human-like dexterity, showcased the method's potential for practical deployment.

Figure 3: Real-world results of applying \textit{ResFiT}, where our residual RL approach shows a significant boost in performance over the base model.

Discussion and Future Work

The approach demonstrates a practical pathway to integrating the strengths of BC and RL, facilitating deployment of RL in environments with sparse rewards and high DoFs. Critiques include the potential for strategy limitations due to static base policies, and the paper proposes future research on adaptive base policies and distillation of improvements into foundational policies. Autonomous real-world learning, independent of human oversight, remains an open challenge.

Conclusion

The paper advances the field of robotics by presenting a scalable method to enhance base BC policies using residual RL in off-policy settings. The proposed \textit{ResFiT} method achieves state-of-the-art results in both simulation and real-world tasks, paving the way for more autonomous, real-world capable RL applications in robotics.