Papers
Topics
Authors
Recent
2000 character limit reached

RL-100: Performant Robotic Manipulation with Real-World Reinforcement Learning (2510.14830v1)

Published 16 Oct 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Real-world robotic manipulation in homes and factories demands reliability, efficiency, and robustness that approach or surpass skilled human operators. We present RL-100, a real-world reinforcement learning training framework built on diffusion visuomotor policies trained bu supervised learning. RL-100 introduces a three-stage pipeline. First, imitation learning leverages human priors. Second, iterative offline reinforcement learning uses an Offline Policy Evaluation procedure, abbreviated OPE, to gate PPO-style updates that are applied in the denoising process for conservative and reliable improvement. Third, online reinforcement learning eliminates residual failure modes. An additional lightweight consistency distillation head compresses the multi-step sampling process in diffusion into a single-step policy, enabling high-frequency control with an order-of-magnitude reduction in latency while preserving task performance. The framework is task-, embodiment-, and representation-agnostic and supports both 3D point clouds and 2D RGB inputs, a variety of robot platforms, and both single-step and action-chunk policies. We evaluate RL-100 on seven real-robot tasks spanning dynamic rigid-body control, such as Push-T and Agile Bowling, fluids and granular pouring, deformable cloth folding, precise dexterous unscrewing, and multi-stage orange juicing. RL-100 attains 100\% success across evaluated trials for a total of 900 out of 900 episodes, including up to 250 out of 250 consecutive trials on one task. The method achieves near-human teleoperation or better time efficiency and demonstrates multi-hour robustness with uninterrupted operation lasting up to two hours.

Summary

  • The paper introduces a unified framework combining imitation learning, iterative offline RL, and online fine-tuning with diffusion-based policies for robust robotic manipulation.
  • It demonstrates deployment-grade reliability with 100% success over 900 real-world episodes and efficiency matching human teleoperation across diverse tasks.
  • Experimental results show strong generalization, fast adaptation to novel conditions, and a single-step consistency model that cuts inference latency by an order of magnitude.

RL-100: A Unified Framework for Real-World Robotic Manipulation via Reinforcement Learning and Diffusion Policies

RL-100 introduces a comprehensive framework for real-world robotic manipulation, integrating imitation learning (IL), iterative offline reinforcement learning (RL), and online RL fine-tuning atop diffusion-based visuomotor policies. The system is designed to achieve deployment-grade reliability, efficiency, and robustness across diverse manipulation tasks and robot embodiments, leveraging human priors while enabling autonomous policy improvement. This essay provides a technical summary of RL-100, its methodological innovations, experimental results, and implications for scalable robot learning. Figure 1

Figure 1: Real-robot snapshots illustrating the diversity of the RL-100 task suite, spanning dynamic rigid-body control, fluids, deformable objects, and multi-stage manipulation.


Methodological Framework

Three-Stage Training Pipeline

RL-100 employs a three-stage pipeline:

  1. Imitation Learning (IL): Policies are initialized via behavior cloning on human teleoperated demonstrations, using conditional diffusion models to learn robust visuomotor mappings from RGB or 3D point cloud observations and proprioception to actions. The diffusion policy backbone supports both single-step and action-chunk control modes, enabling adaptation to task requirements.
  2. Iterative Offline RL: Policy improvement is performed on a growing buffer of rollouts, using a PPO-style objective applied across the denoising steps of the diffusion process. Offline Policy Evaluation (OPE) gates policy updates, ensuring conservative and monotonic improvement. The value function is estimated via IQL, and the visual encoder is frozen for stability.
  3. Online RL Fine-Tuning: On-policy RL is used for final optimization, targeting rare failure modes and maximizing deployment metrics. Generalized Advantage Estimation (GAE) is used for advantage computation, and exploration is controlled via variance clipping in the DDIM sampler.

A lightweight consistency distillation head compresses the multi-step diffusion policy into a single-step consistency model, enabling high-frequency control with minimal latency.

(Figure 2)

Figure 2: RL-100 pipeline: IL pretraining, iterative offline RL with data expansion, and online RL fine-tuning.

Diffusion Policy and Consistency Model

The policy backbone is a conditional diffusion model over actions, parameterized for either ϵ\epsilon-prediction or x0x_0-prediction. The denoising process is embedded as a sub-MDP within each environment step, and policy gradients are computed over the chain of denoising steps. Consistency distillation enables deployment of a single-step policy, matching the performance of the multi-step diffusion policy while achieving an order-of-magnitude reduction in inference latency.

Representation-Agnostic Design

RL-100 is agnostic to input modality (2D RGB or 3D point clouds), robot embodiment, and control regime. The visual encoder is self-supervised and regularized for stability during RL fine-tuning. Action heads are adapted for single-step or chunked control, supporting both fast reactive tasks and precision coordination.


Experimental Results

Real-World Task Suite

RL-100 is evaluated on seven real-robot tasks, including dynamic Push-T, agile bowling, pouring (granular and fluid), dynamic unscrewing, dual-arm soft-towel folding, and multi-stage orange juicing. The suite covers rigid-body, deformable, and fluid manipulation, with randomized initial conditions and challenging physical variations.

(Figure 3)

Figure 3: Rollout trajectories for seven real-world tasks, visualized in point clouds.

Reliability, Efficiency, and Robustness

  • Reliability: RL-100 achieves 100% success across 900 consecutive real-world episodes, including 250/250 on dual-arm folding. Iterative offline RL raises average success from 70.6% (imitation baseline) to 91.1%, and online RL eliminates residual failures.
  • Efficiency: RL-100 matches or surpasses human teleoperation in time-to-completion and throughput, with shorter episode lengths and lower wall-clock times. The consistency model variant enables high-frequency control (up to 378 Hz in simulation), removing inference bottlenecks.
  • Robustness: Policies generalize zero-shot to novel dynamics (mean 92.5% success) and adapt few-shot to substantial task variations (mean 86.7% after 1–3 hours of additional training). RL-100 demonstrates robust recovery from physical disturbances (mean 95% success under perturbations).

(Figure 4)

Figure 4: RL-100 generalizes zero-shot to novel dynamics and adapts few-shot to new task variations.

Simulation Benchmarks

RL-100 outperforms state-of-the-art diffusion/flow-based RL methods (DPPO, ReinFlow, DSRL) on MuJoCo locomotion, Adroit dexterous manipulation, and Meta-World precision tasks. It achieves higher asymptotic returns, faster convergence, and lower variance across seeds. Ablations confirm the benefits of 3D point cloud input, variance clipping, consistency distillation, and ϵ\epsilon-prediction parameterization for exploration and stability.

(Figure 5)

Figure 5: RL-100 achieves superior learning curves and inference speed compared to baselines in simulation.


Implementation Considerations

  • Computational Requirements: RL-100's training pipeline is efficient, with most data collected during iterative offline RL and minimal online RL budget. The consistency model enables real-time deployment, with inference latency limited by perception hardware.
  • Scaling: The framework is compatible with large-scale datasets and multi-task, multi-robot settings. The modular design supports extension to vision-language-action models and cross-embodiment transfer.
  • Deployment: RL-100 policies are robust to distribution shift, physical disturbances, and long-horizon operation. Conservative OPE-gated updates and variance clipping ensure safety and stability during real-world training.

Implications and Future Directions

RL-100 demonstrates that unified IL+RL training atop diffusion policies yields deployment-grade reliability, efficiency, and robustness for real-world robotic manipulation. The framework's generality across tasks, embodiments, and representations, combined with high-frequency control via consistency distillation, addresses key bottlenecks in practical robot learning.

Future work should extend RL-100 to more complex, cluttered, and partially observable environments, scale to multi-task and multi-robot VLA models, and develop autonomous reset and recovery mechanisms. Investigating scaling laws for data/model size versus sample efficiency, and aligning large VLA priors with RL-100's unified objective, will further advance the deployment of autonomous manipulators in unstructured settings.


Conclusion

RL-100 provides a unified, deployment-centric framework for real-world robot learning, integrating imitation and reinforcement learning under a single objective and leveraging diffusion-based policies for expressive, robust control. The system achieves perfect reliability, human-level or better efficiency, and strong generalization across diverse manipulation tasks and embodiments. RL-100 represents a significant advance toward practical, scalable, and robust robot learning systems suitable for homes and factories.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 50 likes about this paper.