REPPO: Relative Entropy Pathwise Policy Optimization
- REPPO is an on-policy reinforcement learning algorithm that fuses pathwise policy gradients with KL-divergence control for low-variance, robust updates.
- It leverages recent on-policy data with multi-step TD(λ) to train action-conditioned Q-value models, enhancing efficiency and stability.
- Adaptive dual updates for entropy and KL penalties ensure controlled exploration and reliable performance in continuous control and robotic applications.
Relative Entropy Pathwise Policy Optimization (REPPO) is an on-policy reinforcement learning algorithm that fuses pathwise policy gradient techniques with explicit relative entropy (KL-divergence) control to achieve efficient, stable, and robust policy improvement. By leveraging direct value-gradient information from action-conditioned Q-value models—trained solely on recent on-policy data—REPPO achieves low-variance updates without relying on large off-policy replay buffers, while integrating mechanisms for controlled exploration and targeted policy regularization. This synthesis enables REPPO to combine the efficiency and stability characteristic of off-policy value-gradient methods with the simplicity and memory footprint of standard on-policy policy gradient pipelines (Voelcker et al., 15 Jul 2025).
1. Algorithmic Foundations and Design
REPPO is built on the insight that pathwise (deterministic) policy gradients can yield substantially lower variance than classic score-function (likelihood ratio) estimators, provided that the Q-function model is sufficiently accurate on-policy. In contrast to off-policy variants, REPPO enforces an on-policy learning regime in which Q-values are trained exclusively from the most recent trajectory rollouts using multi-step Temporal Difference (TD) methods (notably, TD(λ)).
The central policy improvement step employs the pathwise gradient estimator:
where is the current deterministic (or near-deterministic) policy, and is the learned state-action value model. To ensure stable policy improvement, REPPO augments the policy loss with a forward KL-divergence penalty enforcing proximity to the previous policy:
with separate, jointly updated Lagrange multipliers (entropy) and (KL), balancing exploration and conservatism.
Notably, the action-conditioned value function is trained from the current on-policy data using stable multi-step TD(λ) regression:
with indicating termination.
2. On-Policy Value-Gradient Learning
A key technical contribution is the demonstration that accurate surrogate Q-function models can be reliably trained from purely on-policy data using multi-step, bootstrapped value targets (TD(λ)), bolstered by auxiliary self-supervised representation losses (such as self-prediction on latent activations) and robust regression objectives (e.g., categorical HL-Gauss cross-entropy).
This architecture avoids the instability typical in prior attempts to deploy pathwise gradients with strictly on-policy data, which were frequently undermined by insufficiently robust value function estimates and high gradient variance. REPPO's approach enables tight coupling of the value target distribution to the current policy state-visitation distribution, mitigating historical data drift.
3. Policy Regularization: Maximum Entropy and KL Control
REPPO integrates a maximum entropy objective into the policy optimization workflow:
where is a tunable entropy reward weight and is the policy entropy. This encourages persistent exploration and prevents premature collapse to deterministic action selection, a common failure mode in high-dimensional control and sparse reward regimes (Ahmed et al., 2018). Jointly, the KL-divergence penalty
explicitly bounds the deviation of updates, protecting against instability due to value function misspecification.
REPPO employs adaptive gradient-based updates for both and (the entropy and KL multipliers):
This adaptive dual update ensures constraints are actively enforced throughout optimization.
4. Empirical Performance and Practical Impact
Extensive experiments across over thirty GPU-parallelized continuous control environments confirm that REPPO provides rapid learning (decreased sample requirements), robust convergence, and outperforms tuned PPO baselines in terms of both wall-clock time and total environment interactions. Like FastTD3 (a high-throughput off-policy baseline), REPPO achieves high sample and computational efficiency, but with a substantially reduced memory footprint, as there is no need for persistent replay buffers.
A distinguishing feature is consistent hyperparameter robustness: the joint tuning of and , along with representation normalization and auxiliary tasks, yields reliable performance across domains without extensive per-environment adjustment. This property is particularly valuable for large-scale application in robotics and simulation-based RL.
5. Technical Formulation
The principal components of REPPO can be summarized as follows:
| Component | Mathematical Formulation |
|---|---|
| Pathwise Policy Gradient | |
| Value Learning (TD(λ) Target) | |
| Maximum Entropy RL Objective | |
| KL-Constrained Policy Objective | |
| Adaptive Multiplier Updates |
6. Applications and Related Directions
REPPO is well suited to domains where the stability and sample efficiency of value-gradient learning are crucial but the memory constraints or non-stationarity of classic off-policy learning are prohibitive. Concrete application areas include:
- Robotic control and manipulation, where real-time adaptation and bounded memory resources are critical.
- Game-playing agents in simulated or physical settings, benefiting from low-variance, exploration-robust policy updates.
- Reinforcement learning-based fine-tuning of LLMs, which require controlled exploration and stable policy regularization over high-dimensional sequential action spaces.
REPPO’s architecture and algorithmic strategy directly align with recent trends in reinforcement learning that advocate the principled integration of information-theoretic regularization (via relative entropy), robust value-learning, and efficient on-policy optimization.
7. Connections to Broader RL Literature
REPPO builds upon and extends the methodological lineage of relative entropy regularization in RL, as explored in Maximum a Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018), Relative Entropy Regularized Policy Iteration (Abdolmaleki et al., 2018), and REPS-style convex duality (Pacchiano et al., 2021). It inherits the trust-region conceptual framework—controlling policy divergence via KL penalties—while resolving variance–bias issues by contemporary pathwise gradient estimators, now rendered stable by on-policy value learning.
Significantly, REPPO diverges from fully off-policy algorithms by tightly coupling the Q-value learning and policy update to the current data distribution, yielding improved robustness under stochastic transitions, non-stationary objective landscapes, and limited data (Voelcker et al., 15 Jul 2025). This positions REPPO as a flexible, theoretically grounded reinforcement learning solution with broad practical impact.