TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

Published 12 May 2026 in cs.RO, cs.AI, and cs.LG | (2605.12236v1)

Abstract: Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel diffusion-based pretraining method that modulates context to balance exploration and exploitation in RL.
It demonstrates that dynamically controlling context noise during RL finetuning substantially improves sample efficiency and out-of-distribution performance.
Empirical evaluations on high-dimensional robotics and vision-language benchmarks validate TMRL’s stability, robust adaptation, and superior action coverage.

Diffusion Timestep-Modulated Pretraining for Enhanced Policy Finetuning in RL

Introduction and Motivation

The paper "TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning" (2605.12236) addresses a persistent challenge in robotics: the sample inefficiency of RL policy finetuning when initialized from behavior-cloned (BC) policies. Standard BC, while effective within the support of demonstration data, leads to highly conditional action distributions $p(a | c)$ with limited generalization and insufficient exploratory coverage in out-of-distribution (OOD) contexts. This impedes RL signal propagation and hampers online adaptation, as suboptimal actions or task variations never see effective exploration.

Addressing these issues, the authors propose a unified pretraining and finetuning framework. Their method, Timestep-Modulated Reinforcement Learning (TMRL), leverages Context-Smoothed Pretraining (CSP) via diffusion-based context noising. The approach enables explicit and dynamic control over action coverage by smoothly interpolating between the tight conditional regime of BC and a broad marginal distribution $p(a)$ . The central claim is that CSP delivers a steerable base policy whose exploration-exploitation balance can be adaptively tuned throughout RL, resulting in substantially improved adaptation and sample efficiency across high-dimensional, vision-conditioned, and real-world tasks.

Methodological Framework

Context-Smoothed Pretraining (CSP)

CSP introduces controlled noise into the context (policy input) $c$ during pretraining, rather than adding noise in the action space. The context is corrupted using a forward diffusion process parameterized by a noise variable or diffusion timestep $o$ . At zero noise, the policy recovers standard BC; at maximal noise, the context contains no information, yielding the marginal $p(a)$ . Intermediate noise levels enable structured aliasing, allowing the policy to "borrow" coherent action sequences from similar but distinct contexts, explicitly expanding its action support while maintaining trajectory-level coherence.

Formally, for context $c$ , the corrupted context $\check{c}$ is sampled via a diffusion kernel $q_{o}(\check{c}|c)$ , and the policy is trained to maximize the likelihood of the action under the noised context as an explicit function of $o$ . Unlike approaches such as Gaussian action-space noise or classifier-free guidance, this method provides smooth, targeted action coverage expansion without execution incoherence.

Timestep-Modulated RL (TMRL)

Post-pretraining, the RL agent is provided explicit control over the context-smoothing noise level $o$ during online finetuning. A high-level RL policy modulates $p(a)$ 0 alongside standard latent variables (e.g., noise $p(a)$ 1 in generative control policies), selecting both the exploration regime and action. This enables dynamic adjustment of exploration and exploitation: higher $p(a)$ 2 facilitates broad exploration in OOD or uncertain contexts, while lower $p(a)$ 3 allows focused exploitation when the context is in-distribution. Notably, the framework is fully compatible with generative policies over various inputs (states, point clouds, visual-language embeddings), making it broadly applicable.

Theoretical Insights

The authors provide a formal analysis demonstrating that context smoothing increases the overlap between a policy's action distribution at different contexts, yielding provable reduction in total variation distance and guaranteeing improvement in demonstrator action coverage. This substantiates that CSP policies address the support-collapse pathology of BC and are inherently more amenable to effective RL finetuning.

Empirical Evaluation

Action Coverage and Zero-shot Performance

CSP outperforms both standard BC and Gaussian action-noise-based approaches (e.g., PostBC) in action coverage, as measured via zero-shot success rates on OOD robotic navigation and manipulation benchmarks (OGBench). The advantage is particularly stark in sparse-coverage or high-dimensional settings, where standard BC and PostBC can exhibit zero success rate at all sample sizes $p(a)$ 4 in the OOD region, while CSP maintains meaningful coverage.

RL Finetuning Efficiency and Stability

TMRL achieves substantial gains in sample efficiency and final policy robustness across a spectrum of challenging RL tasks. In the pointmaze-giant and cube-single settings, TMRL achieves near-100% success, with more than 200% relative gain in sample efficiency over leading baselines such as DSRL and PostBC. Crucially, the method exhibits consistent low-variance learning dynamics, enabling stable adaptation even from sparsely covered policies.

Generalization to Rich Policy Inputs

The authors extend CSP and TMRL to vision-language-action (VLA) policies and pointcloud-conditioned policies:

VLM Embedding Conditioning: Noising VLM embeddings prior to action prediction in image-conditioned robotic manipulation enables borrowed action chunks across tasks, allowing rapid adaptation even in long-horizon, compositional tasks (Libero-90 benchmark), where alternatives fail to make progress.
Pointcloud-based Dexterous Manipulation: Context smoothing across point clouds in high-DOF dexterous grasping leads to rapid adaptation to unseen objects and 2.5 $p(a)$ 5 higher final success rates relative to policy-steering methods.

Real-world Robotic Adaptation

TMRL is demonstrated on real robotic platforms (WidowX 250, Franka Panda) using large-scale VLA policies. Only TMRL successfully enables efficient RL adaptation within an hour of real-world interaction for tasks where the pre-trained policy performs at chance and diffusion policy steering [2] fails. This highlights the real-world applicability and low sample complexity of the proposed approach.

Comparative Analyses and Ablations

Classifier-Free Guidance (CFG) Baseline: TMRL surpasses CFG-based methods, which fail to extrapolate effectively to OOD contexts due to their dependency on $p(a)$ 6 even in the marginal regime.
Exploration Dynamics Visualization: TMRL's exploration is structurally broader and more coherent compared to policies with action-space noise or steering-only methods, supporting more efficient policy improvement.
Dynamic Modulation of Smoothing: TMRL learns to modulate context noise adaptively within trajectories, employing higher exploration noise early in an episode and reducing it as task-relevant context becomes more certain, facilitating optimal tradeoff along the exploitation-exploration continuum.

Implications and Future Directions

The proposed CSP-TMRL pipeline fundamentally reframes pretraining for RL finetuning as the construction of a steerable, smoothing-aware policy with adaptive action coverage. This enables sample-efficient, robust adaptation and addresses a central limitation of BC-based warm starts in RL. The realization that context noising—rather than action perturbation—enables structured exploration without loss of behavioral coherence has both theoretical and practical implications for policy learning in robotics and beyond.

Key implications:

Theoretical: The context smoothing approach is supported by explicit coverage and generalization guarantees, providing a constructive path to prevent support collapse and improve OOD finetuning reliability.
Practical: The architecture-agnostic nature of CSP enables straightforward integration with any context-conditioned generative policy, from VLMs to high-DOF controllers.
Safety Considerations: The broadened action distribution may increase the risk of unsafe actions in the real world; supplementing TMRL with safety filters or learned world models becomes essential in safety-critical deployment.
Sample Efficiency: While outperforming alternatives, TMRL still faces sample efficiency barriers in especially challenging domains, motivating future work on improved high-level steering algorithms and adaptive corruption kernels.

Future Developments

Adaptive or Curriculum-Based Smoothing: Automatic scheduling of smoothing parameters via learned intrinsic objectives or uncertainty estimators could further enhance sample efficiency.
Integration with Model-Based Methods: Combining context-smoothed policies with model-based RL or uncertainty-aware policy optimization may further improve safety and adaptation capability.
Scaling Beyond Robotics: The general framework is directly applicable to any conditional generative policy in domains with limited demonstrator data and OOD deployment; language modeling, multi-task learning, and complex decision processes are natural extensions.

Conclusion

This work establishes a principled and empirically validated framework for RL policy finetuning via context-smoothed pretraining and explicit timestepping modulation. By leveraging diffusion-based context noising, TMRL policies interpolate smoothly across the coverage-exploitation spectrum, enabling efficient, stable, and robust adaptation in both simulation and real-world settings. The framework stands as a substantive advancement for scalable, generalizable policy learning across a breadth of future AI applications (2605.12236).

Markdown Report Issue