LA3P: Loss-Adjusted Actor Prioritized Replay
- The paper introduces a decoupled sampling and loss-adjustment framework that directs high-uncertainty samples to the critic while ensuring stable actor updates.
- It combines uniform sampling with prioritized and inverse-prioritized phases, using PAL for actor updates and Huber loss for critic training to mitigate bias from high TD errors.
- Empirical evaluations on continuous control benchmarks show that LA3P accelerates convergence, reduces variance, and outperforms PER and uniform replay in return performance.
Loss-Adjusted Approximate Actor Prioritized Experience Replay (LA3P) is a deep reinforcement learning (RL) experience replay algorithm designed to address the limitations of Prioritized Experience Replay (PER) in continuous control settings, particularly when used with off-policy actor-critic methods such as TD3 and SAC. LA3P introduces a decoupled sampling and loss-adjustment framework that directs high-uncertainty samples toward the critic, while constraining the actor’s updates to reliable, low-error transitions. Empirical evaluations demonstrate that LA3P significantly outperforms both standard PER and uniform replay, achieving state-of-the-art results in standard continuous-control benchmarks (Saglam et al., 2022).
1. Background: TD Error and Prioritized Experience Replay
In off-policy RL, the critic network is trained to minimize the one-step temporal-difference (TD) error over sampled transitions , with target and TD target
The TD error is defined as
Standard PER assigns sampling probabilities proportional to (plus a constant for nonzero probability), followed by importance-sampling corrections: With this non-uniform sampling, the critic loss is modified to incorporate IS weights: where are normalized IS weights.
Under standard PER, transitions with large TD error — corresponding to high uncertainty — are overrepresented in updates for both actor and critic.
2. Theoretical Motivation: Actor-Critic Gradient Divergence
LA3P is motivated by the observation that actor networks are adversely affected by high TD error transitions. Large errors suggest significant critic estimation error regarding either the current or future value of the policy. This propagates directly to the policy gradient, introducing substantial bias.
Given
and
large TD errors reliably localize critic inaccuracies.
The policy gradient employed by the actor is
The error in corrupts the gradient: Consequently, training the actor on transitions with large TD error leads to unreliable and potentially detrimental policy updates (Saglam et al., 2022).
3. LA3P Framework: Priority Weighting and Loss Adjustment
LA3P systematically decouples critic and actor sampling, using distinct priority schemes and loss corrections:
- Uniform Sampling ( transitions): Both actor and critic are trained together on uniformly sampled transitions (), using the Prioritized Approximate Loss (PAL) to avoid outlier bias:
- Prioritized Critic-Only Sampling ( transitions): The critic is trained on transitions sampled proportionally to priority,
with Huber loss ().
- Inverse-Prioritized Actor-Only Sampling ( transitions): The actor is trained on transitions with the lowest priorities (i.e., lowest TD error), with probability
The decoupled strategy ensures that the critic can focus on difficult (uncertain) transitions, accelerating error reduction, while the actor benefits from reliable gradients derived from transitions for which the critic is accurate.
4. Algorithmic Workflow
The LA3P procedure proceeds as follows:
- Initialization: Actor , critic , target networks , replay buffer , priorities .
- Experience Collection: Store each transition in with .
- Training Iteration: Each training step involves:
- Uniform Phase: Sample transitions uniformly; train actor and critic using PAL; refresh priorities.
- Prioritized Critic Phase: Sample transitions by priority; update critic (Huber loss); refresh priorities.
- Inverse-Prioritized Actor Phase: Sample transitions with inverse priority; update actor; no priority refresh.
- Target Soft Updates: Polyak averaging updates for target network parameters occur after each phase as required.
Priorities are updated as
after each critic update.
| Phase | Sampling Distribution | Loss Function |
|---|---|---|
| Uniform (Both) | Uniform over buffer | PAL (actor & critic) |
| Prioritized (Critic) | Proportional to | Huber (critic only) |
| Inverse (Actor) | Proportional to | Policy Gradient (actor) |
5. Hyperparameterization and Scheduling
Key parameters for LA3P include:
- Priority exponent : 0.4 (controls TD error impact in priority).
- IS-correction exponent : Annealed from 0.4 to 1.0.
- Uniform fraction : 0.5.
- Huber threshold : 1.
- Learning rates : .
- Polyak update rate : 0.005.
- Batch size : 256.
- Initial priority : 1.
- Discount : 0.99.
- Exploration steps: 25,000 initial random actions.
- TD3 policy noise : 0.2 with clipping (SAC uses entropy tuning).
Hyperparameter analysis indicates that balances stability and sample efficiency; extremes ($0.1$ or $0.9$) revert performance to PER-like or purely uniform baselines.
6. Empirical Results and Evaluation
LA3P was evaluated using TD3 and SAC across eight continuous-control environments (MuJoCo: Ant, HalfCheetah, Hopper, Humanoid, Walker2d; Box2D: Swimmer, BipedalWalker, LunarLanderContinuous), each for 1 million steps over 10 random seeds. Principal findings:
- Final Performance: LA3P outperforms baseline methods in nearly all benchmarks. In HalfCheetah, it yields greater return over PER and over uniform. In Swimmer, LA3P is the only method making consistent progress, achieving higher return than uniform.
- Learning Curves: LA3P demonstrates accelerated convergence, higher final returns, and reduced variance, with uniform sampling outperforming PER and LA3P exceeding both.
- Ablation Study: Omitting any LA3P component (inverse prioritization, shared uniform batch, or PAL/LAP loss) substantially degrades performance. Sensitivity to confirms that balanced uniform sampling is essential for stability and effectiveness.
The empirical data support the theoretical premise that decoupling actor and critic update distributions and adjusting loss functions according to sample uncertainty leads to significant improvements in both the efficiency and efficacy of off-policy actor-critic algorithms (Saglam et al., 2022).
7. Context and Implications
LA3P introduces a new branch of prioritized sampling strategies for continuous action RL, overcoming longstanding deficits of PER in actor-critic settings. These findings challenge earlier assumptions that broad prioritization is uniformly beneficial, instead highlighting the necessity to tailor transition selection to the specific requirements of actor versus critic learning processes.
A plausible implication is that further granularity in sample selection, potentially leveraging state- or task-dependent measures of uncertainty, may offer additional gains. The paradigm also underscores the importance of robust loss design (e.g., PAL) and carefully scheduled uniform mixing to maintain learning stability when replay-based prioritization is employed.
LA3P’s framework sets a precedent for experience replay approaches that explicitly recognize and account for the different sources of error and learning objectives present in actor-critic architectures.