Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning

Published 19 May 2026 in cs.RO | (2605.19919v1)

Abstract: Pretrained imitation policies have become a strong foundation for robot manipulation, but they often require online improvement to overcome execution errors, limited dataset coverage, and deployment mismatch. A central question is therefore how reinforcement learning (RL) should adapt policies after offline pretraining. Existing lightweight methods commonly apply residual corrections directly in action space, but this often leads to noisy and poorly structured exploration. In this work, we propose Z-Perturbation Reinforcement Learning (ZPRL), an approach that steers pretrained policies through a compact bottleneck latent rather than through policy weights or output actions. During offline training, we augment the policy with a plug-and-play variational information bottleneck (VIB) module to extract a task-relevant latent interface from observation embeddings. During online finetuning, the base policy is frozen and RL learns only a residual perturbation on this latent, whose decoded representation conditions the frozen action generator. We instantiate ZPRL on flow-matching policies and evaluate it on eight simulation tasks and four real-world tasks. Across diverse manipulation settings, ZPRL improves both sample efficiency and final performance over strong post-training baselines. In the real world, ZPRL improves the average success rate on four tasks by 33.7% over imitation base policies while producing smoother exploration behaviors than an action residual counterpart. These results suggest that a compact, task-aligned bottleneck latent provides an effective interface for online RL adaptation. More videos can be found at https://manutdmoon.github.io/ZPRL/.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces ZPRL, a framework that steers robot policies via a task-aligned bottleneck latent, enhancing RL adaptation efficiency and safety.
It leverages a plug-and-play variational information bottleneck to maintain offline imitation quality while enabling precise latent-space residual adjustments.
Empirical results demonstrate up to 33.7% improved success rates and smoother trajectories compared to action-space residual methods.

ZPRL: Bottleneck Latent Policy Steering for Efficient Robot Manipulation RL

Introduction and Motivation

Offline imitation learning (IL) has been highly successful in robot manipulation, enabling high-capacity policies—often built on transformers or generative models—to reproduce complex behaviors from demonstration data. However, deployment mismatch and error accumulation mean that even strong IL policies remain suboptimal in the real world. To address residual suboptimality, reinforcement learning (RL) is frequently used to adapt and finetune these offline-pretrained policies. The principal challenge is to determine the optimal interface for RL adaptation: should RL intervene in the policy’s weights, its output actions, or a more structured internal representation?

Traditional approaches involve either full policy finetuning (weight space), which is computationally expensive and often ill-suited to modern massive policy classes, or lightweight action-space residual methods, which adapt by learning a corrective action added to the pretrained output. The latter often leads to low-level, unstructured behavior that can be oscillatory or unsafe, especially in contact-rich or dynamic manipulation. Recent studies highlight the potential of steering latent, intermediate variables within the policy, but prior latent-space approaches typically exploit high-dimensional diffusion or noise spaces with limited task-relevance.

This paper introduces Z-Perturbation Reinforcement Learning (ZPRL) (2605.19919), a principled framework for RL policy adaptation via a compact bottleneck latent. ZPRL posits that efficiently leveraging a compressed, task-aligned latent—extracted by a variational information bottleneck (VIB)—provides a semantically structured and low-dimensional interface for RL to steer pre-trained policies. This approach ensures that exploration aligns with meaningful behavioral priors, improving both adaptation efficiency and operational smoothness.

Methodology

Offline Phase: Policy Structure and Bottleneck Interface

The policy backbone follows a contemporary observation-encoder and flow-based action generator architecture, with image and proprioceptive features embedded and fed to a conditional flow model. To carve out a suitable steering interface, the method augments this observation embedding with a plug-and-play VIB module. The VIB is trained to maximize action-predictive informativeness while minimizing irrelevant variation, ensuring the bottleneck retains only task-relevant features. Training employs standard variational objectives and KL regularization over the latent.

This bottleneck is optimized in parallel with the main IL objective, but crucially, gradients from the VIB do not backpropagate into the observation encoder or flow model. As a result, the base policy's behavior cloning performance is unaffected, and the learned latent can later be used as a stable policy steering knob.

Online Phase: RL-Based Latent Steering

During RL-based adaptation, the base IL policy is frozen. ZPRL introduces an actor policy that predicts a residual perturbation in the bottleneck latent space. The process is:

Given observation $o$ , the frozen encoder produces embedding $c$ , and the VIB encoder samples bottleneck latent $z$ .
A learned residual actor (parameterized via SAC) predicts $\Delta z$ conditioned on $(c, z)$ , and the total latent is $\hat{z} = z + \lambda \Delta z$ , with $\lambda$ a tunable scale.
The VIB decoder transforms $\hat{z}$ into the conditioning vector for the frozen flow model, producing the final action.

RL updates are performed solely for the residual actor and a critic on latent pairs, not for the frozen base policy. The critic is conditioned on both the current state embedding and the perturbed latent, to disambiguate state-action dependencies under stochastic bottleneck sampling.

Key design choices such as latent dimensionality, perturbation scaling, and VIB compression are investigated systematically. The perturbation scale balances exploration and stability—excessive scaling can drive latents off the training manifold and degrade the action prior, while undersized scaling limits adaptation. Empirical guidelines for choosing $\lambda$ based on the relative RMS magnitudes of $z$ and $c$ 0 are provided.

Experimental Analysis

Simulation Benchmarks

ZPRL is evaluated on a cross-section of standard manipulation environments: Robomimic (can, square, transport), Adroit (door, hammer, pen), and Metaworld (box-close, push-wall). Policies receive image and proprioceptive input; for each task, policies are trained on fixed offline datasets.

Key findings:

Sample Efficiency: ZPRL achieves competitive or superior adaptation speed, reaching high success rates with fewer interactions versus residual action-space or full-policy finetuning baselines.
Asymptotic Performance: ZPRL matches or surpasses alternatives in final success rate on nearly all tasks. On tasks with smaller action spaces (e.g., Metaworld), the advantages are reduced, as action-space RL is less penalized.
Ablations: Steering on the bottleneck latent is superior to residuals in the raw observation embedding or action space. Moderate-dimensional (8–32) bottlenecks are empirically optimal; too small forfeits task information, too large loses compactness and optimization tractability.
Smoothness: ZPRL produces substantially smoother trajectories (reduced end-effector velocity/acceleration) than action residuals, as high-frequency jitter is suppressed by maintaining action generator regularity.

Real-World Robot Manipulation

ZPRL is deployed on four hardware tasks: Place Orange (fruit placement), Flip Egg (dynamic contact manipulation), Open Box (bimanual coordination), and Insert Bills (deformable object insertion).

Numerical Results:

On Place Orange, Flip Egg, and Open Box, ZPRL outperforms the action-space residual baseline in final success rate and convergence speed ( $c$ 133.7% improvement in average SR across tasks). Gains are accentuated in tasks demanding precise, dynamic interaction.
On Insert Bills, a highly challenging task, ZPRL boosts the pretrained policy's SR from $c$ 2 to $c$ 3 after on-policy RL.
Robustness: ZPRL-trained policies demonstrate nontrivial zero-shot robustness to disturbances (e.g., human intervention, out-of-distribution object initializations), maintaining $c$ 4 SR on average across perturbation cases, with highest robustness for visually perturbed but semantically aligned scenarios.

Theoretical and Practical Implications

ZPRL structurally decomposes the adaptation problem: RL becomes responsible not for open-ended action synthesis, but only for locally steering a strong action prior in a semantically meaningful latent manifold. This division yields several key properties:

Safety and Smoothness: Constraining RL interventions to the bottleneck latent mitigates out-of-distribution action risk and prevents exploration-induced motor instability.
Data Efficiency: By leveraging compressed and task-relevant representations, the RL search space is vastly reduced versus action- or noise-space methods.
Adaptation Bound: The capacity for RL adaptation is bounded by the support of the offline-trained latent manifold—ZPRL cannot generate entirely novel behaviors not encoded in the prior. Thus, policy quality in long-tailed or rare task modes remains contingent on underlying data coverage.

Limitations and Future Directions

While ZPRL offers a robust and efficient RL interface, key limitations persist:

Base Policy Dependence: Adaptation is strictly limited to behaviors representable by the pretrained latent/action prior. Out-of-support exploration is not reliably supported.
Interface Attachment: The current VIB latent requires joint training with the base IL policy; post-hoc fitting to arbitrary pretrained backbones is nontrivial.
Generalization to Entangled Architectures: As policy architectures incorporate multi-level cross-attention or distributed representations (e.g., VLA transformers), the identification and extraction of an effective, single steering latent is less obvious.
Beyond-RL Application: The framework could extend to active iterative imitation or online supervised adaptation, provided the latent structure is well-aligned.

Conclusion

ZPRL establishes a compelling paradigm for online RL post-training in robot manipulation by steering compact, semantically-rich bottleneck latents instead of weight- or action-level interventions. This approach provides a practical middle ground, balancing computational cost, sample efficiency, and safe, structured exploration. Empirical validation in both simulation and real-world robotic tasks demonstrates substantial gains over strong baselines, especially in settings requiring smooth, coordinated behaviors. The results suggest that the interface for RL adaptation is a critical architectural choice in modern robot learning, with latent-space perturbations providing notable advantages. Future research should investigate latent interface attachment in decoupled pipelines, generalization to more complex policy backbones, and extension to broader adaptation scenarios in embodied AI.

Markdown Report Issue