Reusing Trajectories in Policy Gradients Enables Fast Convergence (2506.06178v1)

Published 6 Jun 2025 in cs.LG

Abstract: Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. These methods learn the parameters of parametric policies via stochastic gradient ascent, typically using on-policy trajectory data to estimate the policy gradient. However, such reliance on fresh data makes them sample-inefficient. Indeed, vanilla PG methods require $O(\epsilon^{-2})$ trajectories to reach an $\epsilon$-approximate stationary point. A common strategy to improve efficiency is to reuse off-policy information from past iterations, such as previous gradients or trajectories. While gradient reuse has received substantial theoretical attention, leading to improved rates of $O(\epsilon^{-3/2})$, the reuse of past trajectories remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that extensive reuse of past off-policy trajectories can significantly accelerate convergence in PG methods. We introduce a power mean correction to the multiple importance weighting estimator and propose RPG (Retrospective Policy Gradient), a PG algorithm that combines old and new trajectories for policy updates. Through a novel analysis, we show that, under established assumptions, RPG achieves a sample complexity of $\widetilde{O}(\epsilon^{-1})$, the best known rate in the literature. We further validate empirically our approach against PG methods with state-of-the-art rates.

Summary

The paper introduces a PM estimator that reuses off-policy trajectories to achieve a sample complexity of Õ(1/ε) and faster convergence.
It leverages both past and new on-policy data to update policy parameters while effectively controlling variance.
The results provide practical insights for continuous control tasks, demonstrating improved convergence speed and robustness.

Efficient Convergence in Policy Gradients Through Trajectory Reuse

The paper "Reusing Trajectories in Policy Gradients Enables Fast Convergence," authored by Alessandro Montenegro et al., presents a significant stride in the field of reinforcement learning, emphasizing the role of trajectory reuse in accelerating policy gradient methods (PGs). Policy gradient methods have demonstrated intrinsic robustness and effectiveness in continuous control problems, but their sample inefficiency remains a considerable challenge. This paper addresses this inefficiency by introducing an innovative approach to reusing off-policy trajectories, proposing the \algnameshort (\algname) algorithm, and rigorously proving its convergence rates.

Theoretical Contributions

The authors present a series of theoretical contributions that advance the understanding of trajectory reuse in PGs, diverging from the prevalent focus on gradient reuse. The theoretical foundation establishes the benefits of using past trajectories for gradient estimation, leveraging a new variant of the multiple importance weighting (MIW) estimator known as power mean correction (PM). This estimator enables trajectory data from multiple previous iterations to contribute effectively to gradient computation, achieving a sample complexity of $\widetilde{\mathcal{O}}(\epsilon^{-1})$ . This efficiency marks a substantial improvement over existing PG methods, offering the best-known rate in literature under standard assumptions.

Methodology and Results

The \algname algorithm is designed to fully utilize past trajectories alongside new on-policy samples to update policy parameters. This approach is thoroughly analyzed, showing that the PM estimator bounds estimation error effectively without incurring high variance, a common drawback in importance weighting methods. The resulting sample efficiency is validated empirically across various environments, demonstrating the algorithm's capacity to match or outperform state-of-the-art PG methods in terms of both speed and accuracy of convergence.

Practical and Theoretical Implications

The implications of these findings are twofold. Practically, \algname provides a more efficient framework for training policies in reinforcement learning scenarios, especially where repeated interactions with the environment are costly or infeasible. Theoretically, the work opens new avenues for exploration in variance reduction techniques that extend beyond gradient reuse. This improvement in sample complexity without reliance on large batch sizes or complex variance reduction strategies is particularly appealing for applications in robotics and other domains involving continuous control.

Future Directions

While \algname presents a notable advancement, its reliance on knowledge of variance bounds and a requirement to store previously collected trajectories may pose challenges in certain applications. Future work could explore adaptive techniques for PM coefficients that depend on actual divergence measures rather than pre-determined constants. Additionally, extending the approach to contexts without the variance bound assumption and exploring dimension-free sample complexity would further enhance its universality and applicability.

The paper significantly contributes to the reinforcement learning field, offering a clear path towards more efficient PG methods through thoughtful reuse of trajectory data. By achieving the best-known sample complexity with demonstrable improvement in convergence rates and overall performance, \algname stands as a promising model for future work and application in diverse continuous control tasks.

PDF Markdown

YouTube

Show All Videos