Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 171 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 60 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 437 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Object-Centric Residual Reinforcement Learning

Updated 11 November 2025
  • Object-centric residual RL is an approach that combines ProMPs with SAC-optimized residual policies to correct execution errors in complex manipulation tasks.
  • It employs a variance-based gating mechanism that activates residual actions only in low-variance, task-critical segments to enhance precision and sample efficiency.
  • Empirical evaluations on 7-DoF robots demonstrate that variance-guided residual correction leads to 60–80% success in high-precision insertion tasks within 50 episodes.

Object-centric residual reinforcement learning (OCRRL) is an approach that augments object-centric skill representations with reinforcement learning-driven residual policies, enabling robust, high-precision robotic manipulation even under limited model fidelity or controller accuracy. By leveraging object-centric probabilistic movement primitives (ProMPs) as base skills or policies and learning corrective residuals via deep reinforcement learning—often with policy architectures such as Soft Actor-Critic (SAC)—OCRRL achieves both sample efficiency and adaptability for complex contact-rich, object-relative tasks.

1. Object-centric Probabilistic Movement Primitives

Object-centric ProMPs model expert demonstrations by representing distributions over entire trajectories relative to manipulated objects. Let τ={y0,,yT1}\tau=\{y_0,\dots,y_{T-1}\}, ytRdy_t\in\mathbb{R}^d denote a trajectory (in pose space—e.g., position or full Cartesian pose). Each trajectory is generated as

yt=Φtw+ϵt,ϵtN(0,Σy)y_t = \Phi_t^\top w + \epsilon_t,\quad\epsilon_t\sim\mathcal{N}(0,\Sigma_y)

where ΦtRd×N\Phi_t\in\mathbb{R}^{d\times N} is a vector of time-dependent basis functions, wRNw\in\mathbb{R}^N is the weight vector parameterizing the primitive, and Σy\Sigma_y captures sensorimotor noise.

ProMPs employ a Gaussian prior over ww: p(wΘ)=N(w;μw,Σw)p(w|\Theta) = \mathcal{N}(w; \mu_w, \Sigma_w) yielding a marginal trajectory distribution

p(τ)=N(τ;Φˉμw,ΦˉΣwΦˉ+ITΣy)p(\tau) = \mathcal{N}\left(\tau; \bar\Phi \mu_w, \bar\Phi \Sigma_w \bar\Phi^\top + \mathbf{I}_T\otimes\Sigma_y\right)

where Φˉ\bar\Phi stacks all time steps.

Skill conditioning (e.g., on goal state yt0y_{t_0} with covariance Σˉy\bar\Sigma_y) is performed by Bayesian inference on the weights, resulting in updated means and covariances. In object-centric ProMPs, all frames are expressed relative to the key manipulated object, facilitating generalization over variable initial configurations without losing the geometrical intent of the demonstrated skill.

2. Residual Reinforcement Learning atop Nominal Controllers

While ProMPs encode distributional expert priors, their execution in real robots, especially under low-gain impedance control, often falls short in high-precision portions of the trajectory (such as fine insertions or contact-rich manipulation), leading to systematic residual errors. Residual reinforcement learning (RRL) remedies this by learning a corrective policy πθ\pi_\theta that outputs residual Cartesian actions Δxt\Delta x_t to adjust both translation and orientation: Δxt=[Δpt,  Δθt]πθ(st)\Delta x_t = [\Delta p_t,\;\Delta \theta_t] \sim \pi_\theta(\cdot\,|\,s_t) where sts_t is the robot's state in the object’s reference frame.

The final desired set-point is then synthesized via

pdes=pnom(t)+β(t)Δpt,qdes=[axis-angle(β(t)Δωt)]qnom(t)p_\text{des} = p_\text{nom}(t) + \beta(t)\,\Delta p_t,\qquad q_\text{des} = [\text{axis-angle}(\beta(t)\Delta\omega_t)]\circ q_\text{nom}(t)

where \circ denotes quaternion multiplication, and the gating function β(t){0,1}\beta(t)\in\{0,1\} selectively enables the residual in critical low-variance segments.

The residual policy is optimized off-policy using SAC: JQ(ϕ)=E(s,a,r,s)[(Qϕ(s,a)(r+γEaπ[Qϕˉ(s,a)αlogπ(as)]))2]J_Q(\phi) = \mathbb{E}_{(s,a,r,s')}\left[(Q_\phi(s,a) - (r + \gamma\,\mathbb{E}_{a'\sim\pi}[Q_{\bar\phi}(s',a')-\alpha\log\pi(a'|s')]))^2\right]

Jπ(θ)=Es,a[αlogπ(as)Qϕ(s,a)]J_\pi(\theta) = \mathbb{E}_{s,a}\left[\alpha\log\pi(a|s) - Q_\phi(s,a)\right]

where α\alpha controls entropy regularization.

3. Gating Residual Learning via Demonstration Variability

A distinctive element of object-centric residual RL is the use of demonstration variability to localize exploration. Given the trajectory covariance at time tt

Σy(t)=ΦtΣwΦt+Σy\Sigma_y(t) = \Phi_t^\top\Sigma_w\Phi_t + \Sigma_y

the standard deviation along each dimension σi(t)\sigma_i(t) encodes expert confidence. Define the gating

β(t)={1,i s.t. σi(t)ϵ 0,otherwise\beta(t) = \begin{cases} 1, & \exists i \text{ s.t. } \sigma_i(t)\leq\epsilon\ 0, & \text{otherwise} \end{cases}

with ϵ\epsilon a small threshold (e.g., 3 mm). Residual learning and exploration are suppressed (β=0\beta=0) in high-variance, unconstrained segments and activated only in low-variance, task-critical regions (e.g., near insertion), enhancing sample efficiency and preventing unnecessary interventions where the nominal ProMP is already reliable.

4. End-to-End Training Workflow

The object-centric residual RL paradigm is implemented via a two-stage process:

  1. Offline ProMP Fitting
    • Demonstrated trajectories are regressed onto basis functions to estimate weight means μw\mu_w and covariances Σw\Sigma_w.
  2. Online Residual RL
    • At each episode, the ProMP is conditioned on the randomized start state, producing a nominal trajectory.
    • At every time-step, the covariance-based gate β(t)\beta(t) decides whether to augment with the RL residual.
    • Residual actions, combined with nominal commands, are sent to the robot via low-level impedance control.
    • Experiences are stored in a buffer; SAC is optimized incrementally as soon as buffer warm-up size is reached.

This regime ensures both data efficiency and safety, as the RL module explores exclusively within parts of the skill space where demonstration coverage is precise.

5. Empirical Evaluation: 3D Block Insertion

Evaluation on a 7-DoF Franka Emika Panda, under a low-gain Cartesian impedance controller (1 kHz), demonstrates the strength of object-centric residual RL for high-precision insertion tasks. The state comprises the end-effector pose in the target block frame; actions set desired Cartesian pose at 10 Hz.

A dense reward encourages proximity in both translation (pdespt|p_\text{des}-p_t|) and orientation (quaternion distance). The variance-gating described above improves learning speed and ultimate performance compared to:

  • ProMP-only (no residual): 0% success rate due to large tracking errors at low controller gains.
  • Residual RL always active: Slow convergence, 20–30% success at 50 episodes.
  • Distance-based gating: 0% success, indicating the necessity of a variance-based approach.
  • Variance-based gating: Achieves 60–80% success within 50 episodes; final insertion precision under 3 mm.

This demonstrates that object-centric skill representations, combined with selectively gated residual policies, substantially advance sample-efficient learning in challenging manipulation domains.

6. Methodological Insights and Design Choices

The critical insights underpinning OCRRL include the role of object-centric encoding: by referencing both nominal and corrective actions in the frame of the manipulated object, the system decouples task generalization from workspace specifics. Demonstration variance serves not only as a confidence metric but also as a principled mechanism to partition skill generalization from residual adaptation, thus limiting the RL search to regions where high accuracy is needed.

The RL residual is expressive over both Cartesian translation and orientation, and can compensate for execution errors or unmodeled dynamics—such as those arising from uncalibrated contacts—that cannot be covered by ProMP priors alone. This division of labor translates to higher precision and robustness.

Subsequent work has explored object-centric residual RL in high-dimensional domains. For multi-fingered grasping (e.g., RESPRECT (Ceola et al., 26 Jan 2024)), a base policy π0\pi_0 is pre-trained in simulation across many objects; a residual SAC policy is then trained with the previous policy's action and critics as input, allowing sample-efficient adaptation to new grasps or hardware with object-centric features. The architecture exploits pretrained Masked Autoencoders for vision, and demonstrates successful sim-to-real transfer.

A plausible implication is that object-centric formulation facilitates systematized generalization and swift online adaptation, whereas residualization anchors learning to physically plausible priors, promoting both stability and real-world feasibility. Warm-starting critics from the pre-trained policy further accelerates learning by leveraging generic value structures.

OCRRL thus constitutes a principled avenue for combining geometric task structure, demonstration generalization, and deep RL adaptation for high-performance, data-efficient robotic manipulation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Object-Centric Residual Reinforcement Learning.