Papers
Topics
Authors
Recent
2000 character limit reached

LA3P: Loss-Adjusted Actor Prioritized Replay

Updated 7 January 2026
  • The paper introduces a decoupled sampling and loss-adjustment framework that directs high-uncertainty samples to the critic while ensuring stable actor updates.
  • It combines uniform sampling with prioritized and inverse-prioritized phases, using PAL for actor updates and Huber loss for critic training to mitigate bias from high TD errors.
  • Empirical evaluations on continuous control benchmarks show that LA3P accelerates convergence, reduces variance, and outperforms PER and uniform replay in return performance.

Loss-Adjusted Approximate Actor Prioritized Experience Replay (LA3P) is a deep reinforcement learning (RL) experience replay algorithm designed to address the limitations of Prioritized Experience Replay (PER) in continuous control settings, particularly when used with off-policy actor-critic methods such as TD3 and SAC. LA3P introduces a decoupled sampling and loss-adjustment framework that directs high-uncertainty samples toward the critic, while constraining the actor’s updates to reliable, low-error transitions. Empirical evaluations demonstrate that LA3P significantly outperforms both standard PER and uniform replay, achieving state-of-the-art results in standard continuous-control benchmarks (Saglam et al., 2022).

1. Background: TD Error and Prioritized Experience Replay

In off-policy RL, the critic network QθQ_{\theta} is trained to minimize the one-step temporal-difference (TD) error over sampled transitions τ=(s,a,r,s)\tau=(s,a,r,s'), with target aπϕ(s)a'\sim\pi_{\phi'}(s') and TD target

y(τ)=r+γQθ(s,a).y(\tau)=r+\gamma\,Q_{\theta'}(s',a').

The TD error is defined as

δθ(τ)=y(τ)Qθ(s,a).\delta_{\theta}(\tau)=y(\tau)-Q_{\theta}(s,a).

Standard PER assigns sampling probabilities p(τi)p(\tau_i) proportional to δθ(τi)α|\delta_{\theta}(\tau_i)|^{\alpha} (plus a constant μ\mu for nonzero probability), followed by importance-sampling corrections: p(τi)=δθ(τi)α+μj=1M(δθ(τj)α+μ).p(\tau_i)=\frac{|\delta_{\theta}(\tau_i)|^{\alpha}+\mu}{\sum_{j=1}^M (|\delta_{\theta}(\tau_j)|^{\alpha}+\mu)}. With this non-uniform sampling, the critic loss is modified to incorporate IS weights: LPER(θ)=1NiBw^i[δθ(τi)]2,\mathcal{L}_{\mathrm{PER}}(\theta)=\frac{1}{N}\sum_{i\in\mathcal{B}}\widehat{w}_i\,[\delta_{\theta}(\tau_i)]^2, where w^i\widehat{w}_i are normalized IS weights.

Under standard PER, transitions with large TD error — corresponding to high uncertainty — are overrepresented in updates for both actor and critic.

2. Theoretical Motivation: Actor-Critic Gradient Divergence

LA3P is motivated by the observation that actor networks are adversely affected by high TD error transitions. Large errors suggest significant critic estimation error regarding either the current or future value of the policy. This propagates directly to the policy gradient, introducing substantial bias.

Given

δθ(τt)=rt+γQθ(st+1,at+1)Qθ(st,at)\delta_{\theta}(\tau_t) = r_t + \gamma Q_{\theta}(s_{t+1},a_{t+1}) - Q_{\theta}(s_t,a_t)

and

δθ(τt)Qθ(si,ai)Qπ(si,ai) for i=t or t+1,|\delta_{\theta}(\tau_t)| \propto |Q_{\theta}(s_i,a_i)-Q^{\pi}(s_i,a_i)|\ \text{for}\ i=t\ \text{or}\ t+1,

large TD errors reliably localize critic inaccuracies.

The policy gradient employed by the actor is

ϕJE(s,a)D[logπϕ(as)ϕ  Qθ(s,a)].\nabla_{\phi} J \propto \mathbb{E}_{(s,a)\sim\mathcal{D}} \left[\frac{\partial\log\pi_{\phi}(a|s)}{\partial\phi}\; Q_{\theta}(s,a)\right].

The error in QθQ_\theta corrupts the gradient: ϕapproxϕtrueδθ(τi).|\nabla_\phi^{\mathrm{approx}}-\nabla_\phi^{\mathrm{true}}| \propto |\delta_{\theta}(\tau_i)|. Consequently, training the actor on transitions with large TD error leads to unreliable and potentially detrimental policy updates (Saglam et al., 2022).

3. LA3P Framework: Priority Weighting and Loss Adjustment

LA3P systematically decouples critic and actor sampling, using distinct priority schemes and loss corrections:

  • Uniform Sampling (λN\lambda N transitions): Both actor and critic are trained together on uniformly sampled transitions (λ=0.5\lambda=0.5), using the Prioritized Approximate Loss (PAL) to avoid outlier bias:

LPAL(δi)={12δi2δi1 δi1+α1+αδi>1\mathcal{L}_{\mathrm{PAL}}(\delta_i)= \begin{cases} \tfrac{1}{2}\delta_i^2 & |\delta_i|\le1 \ \frac{|\delta_i|^{1+\alpha}}{1+\alpha} & |\delta_i|>1 \end{cases}

  • Prioritized Critic-Only Sampling ((1λ)N(1{-}\lambda)N transitions): The critic is trained on transitions sampled proportionally to priority,

p(τi)=max(δiα,1),p(\tau_i)=\max(|\delta_i|^{\alpha},1),

with Huber loss (κ=1\kappa=1).

  • Inverse-Prioritized Actor-Only Sampling ((1λ)N(1{-}\lambda)N transitions): The actor is trained on transitions with the lowest priorities (i.e., lowest TD error), with probability

p~(τi)=1/p(τi)j1/p(τj).\widetilde{p}(\tau_i) = \frac{1/p(\tau_i)}{\sum_j 1/p(\tau_j)}.

The decoupled strategy ensures that the critic can focus on difficult (uncertain) transitions, accelerating error reduction, while the actor benefits from reliable gradients derived from transitions for which the critic is accurate.

4. Algorithmic Workflow

The LA3P procedure proceeds as follows:

  1. Initialization: Actor πϕ\pi_\phi, critic QθQ_\theta, target networks ϕ,θ\phi',\theta', replay buffer RR, priorities pi=1p_i=1.
  2. Experience Collection: Store each transition (s,a,r,s)(s,a,r,s') in RR with p=1p=1.
  3. Training Iteration: Each training step involves:
    • Uniform Phase: Sample λN\lambda N transitions uniformly; train actor and critic using PAL; refresh priorities.
    • Prioritized Critic Phase: Sample (1λ)N(1{-}\lambda)N transitions by priority; update critic (Huber loss); refresh priorities.
    • Inverse-Prioritized Actor Phase: Sample (1λ)N(1{-}\lambda)N transitions with inverse priority; update actor; no priority refresh.
    • Target Soft Updates: Polyak averaging updates for target network parameters occur after each phase as required.

Priorities are updated as

p(τi)=max(δθ(τi)α,1)p(\tau_i)=\max(|\delta_{\theta}(\tau_i)|^{\alpha},1)

after each critic update.

Phase Sampling Distribution Loss Function
Uniform (Both) Uniform over buffer PAL (actor & critic)
Prioritized (Critic) Proportional to p(τi)p(\tau_i) Huber (critic only)
Inverse (Actor) Proportional to 1/p(τi)1/p(\tau_i) Policy Gradient (actor)

5. Hyperparameterization and Scheduling

Key parameters for LA3P include:

  • Priority exponent α\alpha: 0.4 (controls TD error impact in priority).
  • IS-correction exponent β\beta: Annealed from 0.4 to 1.0.
  • Uniform fraction λ\lambda: 0.5.
  • Huber threshold κ\kappa: 1.
  • Learning rates ηπ,ηQ\eta_\pi,\eta_Q: 3×1043\times10^{-4}.
  • Polyak update rate ζ\zeta: 0.005.
  • Batch size NN: 256.
  • Initial priority pinitp_{\mathrm{init}}: 1.
  • Discount γ\gamma: 0.99.
  • Exploration steps: 25,000 initial random actions.
  • TD3 policy noise σN\sigma_N: 0.2 with clipping [0.5,0.5][-0.5,0.5] (SAC uses entropy tuning).

Hyperparameter analysis indicates that λ=0.5\lambda=0.5 balances stability and sample efficiency; extremes ($0.1$ or $0.9$) revert performance to PER-like or purely uniform baselines.

6. Empirical Results and Evaluation

LA3P was evaluated using TD3 and SAC across eight continuous-control environments (MuJoCo: Ant, HalfCheetah, Hopper, Humanoid, Walker2d; Box2D: Swimmer, BipedalWalker, LunarLanderContinuous), each for 1 million steps over 10 random seeds. Principal findings:

  • Final Performance: LA3P outperforms baseline methods in nearly all benchmarks. In HalfCheetah, it yields 50%\sim50\% greater return over PER and 10%\sim10\% over uniform. In Swimmer, LA3P is the only method making consistent progress, achieving 2×\sim2\times higher return than uniform.
  • Learning Curves: LA3P demonstrates accelerated convergence, higher final returns, and reduced variance, with uniform sampling outperforming PER and LA3P exceeding both.
  • Ablation Study: Omitting any LA3P component (inverse prioritization, shared uniform batch, or PAL/LAP loss) substantially degrades performance. Sensitivity to λ\lambda confirms that balanced uniform sampling is essential for stability and effectiveness.

The empirical data support the theoretical premise that decoupling actor and critic update distributions and adjusting loss functions according to sample uncertainty leads to significant improvements in both the efficiency and efficacy of off-policy actor-critic algorithms (Saglam et al., 2022).

7. Context and Implications

LA3P introduces a new branch of prioritized sampling strategies for continuous action RL, overcoming longstanding deficits of PER in actor-critic settings. These findings challenge earlier assumptions that broad prioritization is uniformly beneficial, instead highlighting the necessity to tailor transition selection to the specific requirements of actor versus critic learning processes.

A plausible implication is that further granularity in sample selection, potentially leveraging state- or task-dependent measures of uncertainty, may offer additional gains. The paradigm also underscores the importance of robust loss design (e.g., PAL) and carefully scheduled uniform mixing to maintain learning stability when replay-based prioritization is employed.

LA3P’s framework sets a precedent for experience replay approaches that explicitly recognize and account for the different sources of error and learning objectives present in actor-critic architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Loss-Adjusted Approximate Actor Prioritized Experience Replay (LA3P).