Object-Centric Residual Reinforcement Learning
- Object-centric residual RL is an approach that combines ProMPs with SAC-optimized residual policies to correct execution errors in complex manipulation tasks.
- It employs a variance-based gating mechanism that activates residual actions only in low-variance, task-critical segments to enhance precision and sample efficiency.
- Empirical evaluations on 7-DoF robots demonstrate that variance-guided residual correction leads to 60–80% success in high-precision insertion tasks within 50 episodes.
Object-centric residual reinforcement learning (OCRRL) is an approach that augments object-centric skill representations with reinforcement learning-driven residual policies, enabling robust, high-precision robotic manipulation even under limited model fidelity or controller accuracy. By leveraging object-centric probabilistic movement primitives (ProMPs) as base skills or policies and learning corrective residuals via deep reinforcement learning—often with policy architectures such as Soft Actor-Critic (SAC)—OCRRL achieves both sample efficiency and adaptability for complex contact-rich, object-relative tasks.
1. Object-centric Probabilistic Movement Primitives
Object-centric ProMPs model expert demonstrations by representing distributions over entire trajectories relative to manipulated objects. Let , denote a trajectory (in pose space—e.g., position or full Cartesian pose). Each trajectory is generated as
where is a vector of time-dependent basis functions, is the weight vector parameterizing the primitive, and captures sensorimotor noise.
ProMPs employ a Gaussian prior over : yielding a marginal trajectory distribution
where stacks all time steps.
Skill conditioning (e.g., on goal state with covariance ) is performed by Bayesian inference on the weights, resulting in updated means and covariances. In object-centric ProMPs, all frames are expressed relative to the key manipulated object, facilitating generalization over variable initial configurations without losing the geometrical intent of the demonstrated skill.
2. Residual Reinforcement Learning atop Nominal Controllers
While ProMPs encode distributional expert priors, their execution in real robots, especially under low-gain impedance control, often falls short in high-precision portions of the trajectory (such as fine insertions or contact-rich manipulation), leading to systematic residual errors. Residual reinforcement learning (RRL) remedies this by learning a corrective policy that outputs residual Cartesian actions to adjust both translation and orientation: where is the robot's state in the object’s reference frame.
The final desired set-point is then synthesized via
where denotes quaternion multiplication, and the gating function selectively enables the residual in critical low-variance segments.
The residual policy is optimized off-policy using SAC:
where controls entropy regularization.
3. Gating Residual Learning via Demonstration Variability
A distinctive element of object-centric residual RL is the use of demonstration variability to localize exploration. Given the trajectory covariance at time
the standard deviation along each dimension encodes expert confidence. Define the gating
with a small threshold (e.g., 3 mm). Residual learning and exploration are suppressed () in high-variance, unconstrained segments and activated only in low-variance, task-critical regions (e.g., near insertion), enhancing sample efficiency and preventing unnecessary interventions where the nominal ProMP is already reliable.
4. End-to-End Training Workflow
The object-centric residual RL paradigm is implemented via a two-stage process:
- Offline ProMP Fitting
- Demonstrated trajectories are regressed onto basis functions to estimate weight means and covariances .
- Online Residual RL
- At each episode, the ProMP is conditioned on the randomized start state, producing a nominal trajectory.
- At every time-step, the covariance-based gate decides whether to augment with the RL residual.
- Residual actions, combined with nominal commands, are sent to the robot via low-level impedance control.
- Experiences are stored in a buffer; SAC is optimized incrementally as soon as buffer warm-up size is reached.
This regime ensures both data efficiency and safety, as the RL module explores exclusively within parts of the skill space where demonstration coverage is precise.
5. Empirical Evaluation: 3D Block Insertion
Evaluation on a 7-DoF Franka Emika Panda, under a low-gain Cartesian impedance controller (1 kHz), demonstrates the strength of object-centric residual RL for high-precision insertion tasks. The state comprises the end-effector pose in the target block frame; actions set desired Cartesian pose at 10 Hz.
A dense reward encourages proximity in both translation () and orientation (quaternion distance). The variance-gating described above improves learning speed and ultimate performance compared to:
- ProMP-only (no residual): 0% success rate due to large tracking errors at low controller gains.
- Residual RL always active: Slow convergence, 20–30% success at 50 episodes.
- Distance-based gating: 0% success, indicating the necessity of a variance-based approach.
- Variance-based gating: Achieves 60–80% success within 50 episodes; final insertion precision under 3 mm.
This demonstrates that object-centric skill representations, combined with selectively gated residual policies, substantially advance sample-efficient learning in challenging manipulation domains.
6. Methodological Insights and Design Choices
The critical insights underpinning OCRRL include the role of object-centric encoding: by referencing both nominal and corrective actions in the frame of the manipulated object, the system decouples task generalization from workspace specifics. Demonstration variance serves not only as a confidence metric but also as a principled mechanism to partition skill generalization from residual adaptation, thus limiting the RL search to regions where high accuracy is needed.
The RL residual is expressive over both Cartesian translation and orientation, and can compensate for execution errors or unmodeled dynamics—such as those arising from uncalibrated contacts—that cannot be covered by ProMP priors alone. This division of labor translates to higher precision and robustness.
7. Extensions and Related Paradigms
Subsequent work has explored object-centric residual RL in high-dimensional domains. For multi-fingered grasping (e.g., RESPRECT (Ceola et al., 26 Jan 2024)), a base policy is pre-trained in simulation across many objects; a residual SAC policy is then trained with the previous policy's action and critics as input, allowing sample-efficient adaptation to new grasps or hardware with object-centric features. The architecture exploits pretrained Masked Autoencoders for vision, and demonstrates successful sim-to-real transfer.
A plausible implication is that object-centric formulation facilitates systematized generalization and swift online adaptation, whereas residualization anchors learning to physically plausible priors, promoting both stability and real-world feasibility. Warm-starting critics from the pre-trained policy further accelerates learning by leveraging generic value structures.
OCRRL thus constitutes a principled avenue for combining geometric task structure, demonstration generalization, and deep RL adaptation for high-performance, data-efficient robotic manipulation.