- The paper introduces PQAC, which filters out noisy TD errors via nonlinear gradient updates and pseudo quantization.
- It details a sigmoid-based optimality model combined with multi-level quantization to suppress outlier gradients in deep RL.
- Empirical evaluations on continuous control tasks demonstrate PQAC’s superior stability, performance, and computational efficiency.
Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error
Introduction and Motivation
This work introduces the Pseudo-Quantized Actor-Critic (PQAC) algorithm, a reinforcement learning (RL) framework designed to enhance robustness to noise in Temporal Difference (TD) errors. The fundamental issue addressed is the instability of learning in deep RL due to the noisy nature of bootstrapped TD errors, especially with nonlinear function approximators. While methods such as target networks and ensemble models have been widely adopted to mitigate unstable learning dynamics, they introduce downsides, including lower sample efficiency and significantly increased computational resources. The PQAC methodology is derived via the control as inference framework, focusing on formulating learning objectives and updating rules that inherently filter out high-noise TD signals without introducing the overhead and inefficiency of existing heuristics.
Theoretical Contributions
Control as Inference Revisited
PQAC builds on the control as inference framework, wherein the RL objective is cast as a probabilistic inference problem, leveraging a random optimality variable and defining probabilistic value and action-value functions. Conventionally, this perspective uses an exponential mapping (often softened by temperature) for the optimality variable, but this model is not always a proper probability distribution and is sensitive to misestimation of bounds.
This work proposes, instead, to use a sigmoid-based model for the optimality probability, consistently bounded in [0,1], and accommodates estimation error gracefully. Formally,
p(O=1∣s)=σ(λO​(V(s)−μO​));p(O=1∣s,a)=σ(λO​(Q(s,a)−μO​)),
where λO​ controls the sharpness, and μO​ is set according to (estimated) upper and lower bounds of the possible returns.
A major strength of the proposed approach is the direct derivation of gradient update rules for both policy and value parameters, based on minimizing either the forward or reverse Kullback–Leibler (KL) divergence between the model and optimality distributions. Compared to classical actor-critic updates, the PQAC update employs a nonlinear weighting of the TD error:
- For reverse KL (RKL): The weight is λO​σV​(1−σV​)δ, vanishing (i.e., gradient = 0) for value estimates near the extremes.
- For forward KL (FKL): The difference (σV+δ​−σV​) saturates, so that large, likely-noisy TD errors have little effect on learning.
The analysis demonstrates that these nonlinearities induce implicit outlier rejection: when a TD error is implausibly large, the gradient with respect to that sample vanishes, reducing the impact of noise or estimation errors.
Pseudo-Quantization via Multi-Level Optimality
The paper further extends robustness by decomposing the optimality variable into L>1 levels, effectively introducing a pseudo-quantization of the TD error through a multiplicative stack of sigmoids, each centered at different values. This approach causes the saturation (and thus rejection) property to be activated across several partitions of the TD spectrum, not just at the extremes, offering more fine-grained noise suppression. Concretely, for L+1 quantization levels, the overall gradient becomes an average over L nonlinearly weighted TD errors, each centered differently, further reducing the influence of outlier targets.
Jensen-Shannon Divergence for Balanced Gradient Saturation
To unify and leverage the different gradient-vanishing properties of forward and reverse KL divergences, the paper derives update rules based on the symmetric Jensen-Shannon (JS) divergence between model and optimality distributions. The resulting gradients inherit saturation from both KL types, and, after an empirically justified rescaling, present a balanced and stable nonlinear weighting over TD errors. Analytical and empirical contour plots show that these updates combine the outlier rejection capacity of RKL and FKL, but relax their overly strong nonlinearities, resulting in stable and robust updates.
Empirical Evaluation
Benchmark and Setup
Extensive simulation studies are performed in continuous control tasks (HalfCheetah, Pusher, Ant, Swimmer from Mujoco/Gymnasium), with ablations and rigorous baselines, including standard vanilla actor-critic (linear TD error), MME (max-min entropy), and variants using RKL, FKL, Jeffreys, or JS update rules. Key implementation details (network architectures, normalization, optimizers, buffer, and ensemble use) are explicitly controlled. Evaluation is performed over 28 random seeds, and robustness is assessed via performance profile diagrams.
Ablation: Role of Quantization Level
The experiments confirm that increasing the quantization level (L) beyond trivial binary saturation is important. For both HalfCheetah and Pusher, p(O=1∣s)=σ(λO​(V(s)−μO​));p(O=1∣s,a)=σ(λO​(Q(s,a)−μO​)),0 yields a significant performance improvement over binary (p(O=1∣s)=σ(λO​(V(s)−μO​));p(O=1∣s,a)=σ(λO​(Q(s,a)−μO​)),1), but further increases provide diminishing returns and can reduce performance due to computational inefficiency and excessive gradient suppression.
Divergence Choices and Fusion
- RKL-based learning excels in tasks where suppressing the negative effect of overestimated TD error is critical (e.g., HalfCheetah).
- FKL-based learning is superior where robust underestimation is beneficial (e.g., Pusher).
- The Jeffreys-mean of both, in contrast, exhibits the weaknesses of both, due to symmetrical cancellation.
- The JS-based approach achieves the best characteristics of RKL and FKL, leading to consistently strong or task-adaptive performance.
Robustness to Noisy TD Signals
PQAC demonstrates strong robustness in ablations that remove stabilization heuristics (ensemble, robust targets) or corrupt the reward function with noise. In challenging scenarios, such as noisy reward imitation (guided RL) and high-dimensional action spaces, PQAC either matches or outperforms state-of-the-art robust baselines (like MME), and always significantly outperforms vanilla actor-critic, whose instability and error amplification under noise are pronounced.
Notably, in high-dimensional continuous control, where value estimation is especially difficult, PQAC provides stability and enables success in Ant and Swimmer variants with large action spaces.
Practical and Theoretical Implications
PQAC offers theoretically principled, computationally efficient, and empirically verified robust RL under noisy or adversarial TD errors. The framework obviates the need for target/ensemble heuristics in many situations, which is crucial for deployment on systems with limited resources, such as robotic platforms or embedded learning tasks.
The approach's outlier rejection and robustness properties have critical implications for:
- Preference-based RL or RL from human feedback, where reward labels are inherently noisy or inconsistent.
- High-dimensional continuous control, such as advanced robotics and manipulation, where approximation error and reward ambiguity are rampant.
- Scenarios requiring large-scale exploration, suggesting future synergies with active exploration policies.
By implicitly rejecting gradients from highly noisy data, PQAC not only enhances stability but also opens avenues for more aggressive exploration and learning from weak/noisy supervision, as often encountered in real-world RL deployments.
Conclusion
The PQAC algorithm, grounded in a control-as-inference framework with robust nonlinear TD error transformation and pseudo-quantization, achieves stable and efficient learning under severe TD noise. By leveraging both multi-level gradient saturation and Jensen-Shannon divergence-based updates, PQAC generalizes and improves upon conventional heuristics, yielding robust performance across tasks with noisy value targets and high-dimensional actions. The results position PQAC as a promising base for RL in domains where noise, reward ambiguity, or computational constraint would otherwise hinder learning, with future work expected in improved upper/lower bound estimation, adaptive quantization, and integration with advanced exploration strategies.
Reference: "Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error" (2604.01613)