- The paper presents a novel framework (PPT) that combines ProMP reparameterization, PPO learning in weight space, and an energy tank to achieve contact-safe robot manipulation.
- It demonstrates superior performance with lower jerk, reduced energy spikes, and higher contact continuity across simulated and real-world tasks.
- The framework enables robust generalization and seamless sim-to-real transfer while strictly enforcing energy constraints for safety.
Motivation and Context
Contact-rich robotic manipulation tasks present significant challenges due to discontinuous dynamics, complex energy exchange, and stringent safety requirements. Unaddressed, these factors result in unstable interactions, excessive force applications, and potentially unsafe behavior, especially under non-trivial contact conditions. Traditional RL approaches commonly operate in robot joint space, lacking trajectory-level smoothness and explicit energy safety guarantees. While movement primitives, passivity-based control, and standard SafeRL techniques address pieces of the problem, robust generalization, smoothness, and contact-aware safety under high-dimensional contact forces have not been jointly addressed in a unified framework. This paper introduces a comprehensive solution by integrating reinforcement learning, Probabilistic Movement Primitives (ProMPs), and energy-aware control — specifically, a passivity-enforcing energy tank — for adaptive, safe, and robust contact-rich manipulation (2511.13459).
Methodological Framework
The proposed framework, termed PPT (ProMP PPO Energy-Tank), innovatively synthesizes three components:
- Trajectory Representation with ProMPs: Trajectories are encoded as distributions over Radial Basis Function (RBF) weights, facilitating low-dimensional, smooth, and probabilistic motion modeling. Conditioning on via-points enables adaptation to geometry or task constraints, which is particularly effective in unseen environments or tasks with unexpected contact conditions.
- Adaptive Policy Learning via PPO in ProMP Weight Space: Rather than low-level torque or velocity commands, the policy operates in the ProMP parameter space, outputting residual weight updates conditioned on current observations (including phase and contact information). Training is conducted using Proximal Policy Optimization (PPO), which stabilizes policy updates and maintains the structural smoothness inherited from ProMPs.
- Passivity and Energy Safety through an Energy-Tank Mechanism: Physical safety is maintained by integrating an energy tank layer, which dynamically rescales policy outputs to respect power (and thus energy) constraints. This ensures that the system remains passive at all times, directly preventing damaging force bursts or uncontrolled energy injection during contact events. The energy tank is updated at every time step based on instantaneous power and energy limits, enforcing hardware-safe operation regardless of exploration.
This triad is executed within a Cartesian impedÂance control loop, translating the task-space trajectories generated by ProMPs into compliant end-effector wrenches. By decoupling geometric adaptation (via-point conditioning) from task performance-driven learning (policy residuals), the framework delivers adaptable yet smooth and safety-constrained trajectories.
Experimental Evaluation and Quantitative Results
Extensive empirical validation is conducted across simulation and real-world settings on two canonical contact-rich tasks: planar box pushing and 3D maze sliding.
Key experimental findings:
- Box Pushing: PPT (ProMP+PPO+Tank) demonstrates rapid and stable learning curves, achieving high success rates and maintaining the lowest peak powers during training. Step-wise PPO baselines exhibit high variance and pronounced jerk due to myopic action selection. The energy tank effectively moderates energy spikes from exploratory actions without significant impact on task progress.
- 3D Maze Sliding: The model is trained only on straight corridors but generalizes robustly to previously unseen geometries (curves, 3D undulations) without reward retuning or architectural changes. PPT produces trajectories with lower jerk, reduced 95th-percentile wrench, fewer overload events, and nearly 50% higher contact continuity relative to step-wise PPO under the same power budget.
- Sim-to-Real Transfer: The structured prior and policy residual decomposition, combined with energy constraints, enable direct transfer from simulation to hardware. Success rates and contact-safety metrics are preserved despite real-world unmodeled frictions and measurement noise.
The following table summarizes comparative metrics for maze sliding (higher is better for success rate/contact continuity; lower is better for jerk/wrench/overload):
| Metric |
PPT |
Step-wise PPO w/ Tank (ST) |
| Success rate (%) |
Higher |
Lower |
| Jerk RMS (m/s³) |
Lower |
Higher |
| Peak Wrench P95 (N) |
Lower |
Higher |
| Overload Ratio (%) |
Lower |
Higher |
| Contact Continuity |
Higher |
Lower |
| Progress@T |
Slightly Lower |
Slightly Higher |
The numerical data show that PPT consistently outperforms baselines in terms of stability, safety, and smoothness, albeit sometimes trading speed for strict safety.
Implications and Theoretical Insights
The explicit division between geometric adaptation (via-point-conditioned ProMPs) and policy residual refinement optimizes both robustness and adaptability. Integrating policy learning in trajectory-parameter space enforces global smoothness, directly addressing high-frequency instability intrinsic to step-wise RL. The energy-tank mechanism, acting at execution time, ensures hard constraints on system passivity and energy exchange, functioning as a global safety filter. This approach permits safe real-world exploration without iterative retuning or reward hacking, crucial for scalable application in uncertain contact-dense domains.
Theoretically, the synthesis of trajectory-level priors and RL-driven exploration under a strict energy-aware regime demonstrates that contact-rich RL can achieve both high generalization and strong safety guarantees, which have been historically in tension. The framework's ability to transfer policies across tasks and hardware platforms without finetuning is a practical manifestation of this convergence.
Future Directions
While effective, the conservative nature of a fixed-budget energy-tank can restrict certain aspects of performance, particularly in highly dynamic or long-horizon tasks where adaptive energy management is desirable. Future research could incorporate context-sensitive tank sizing, hierarchical or mixture-of-primitive trajectory priors for enhanced expressivity, and integration with multimodal sensing to further close the sim-to-real gap. Combining this paradigm with model-based RL, differentiable simulation, or hybrid residual learning policies may also unlock new levels of autonomy and safety in high-DOF, multi-contact manipulation tasks.
Conclusion
This work presents a technically rigorous and practically validated framework for contact-safe RL, uniting ProMP-based trajectory parameterization, PPO-driven policy refinement in weight space, and energy-tank passivity enforcement. The results substantiate that structure-aware, energy-constrained RL policies achieve superior success rates, safety, and robustness in contact-rich environments, supporting both theoretical and practical advances in autonomous robot manipulation. The approach lays a solid foundation for scalable, safe, and generalizable deployment of RL in real-world contact-rich settings (2511.13459).