Truncated Quantile Critic (TQC) Algorithm

Updated 6 August 2025

TQC is a distributional reinforcement learning algorithm that uses quantile regression, tail truncation, and ensembling to address overestimation bias in continuous control tasks.
It computes value targets by systematically discarding the highest quantile estimates from pooled ensemble critics to ensure stable learning.
Empirical results demonstrate up to 30% performance improvements and robust handling of stochastic, non-linear dynamics across various applications.

The Truncated Quantile Critic (TQC) algorithm is a distributional, off-policy reinforcement learning (RL) method developed to tackle overestimation bias in value-based RL algorithms, particularly in continuous control domains. TQC combines distributional value estimation via quantile regression, systematic truncation of value distribution tails, and ensemble learning to produce robust and accurate value targets for policy improvement. The approach outperforms previous state-of-the-art algorithms on a range of control benchmarks and has demonstrated strong empirical performance in both simulated and real-world applications.

1. Theoretical Foundations and Motivation

The TQC algorithm is motivated by two principal observations in RL: (1) standard value-based algorithms, such as Q-learning and its deep extensions, tend to suffer from overestimation bias, which can decrease performance and stability, especially in the presence of stochasticity or function approximation; and (2) capturing the full return distribution with quantile-based critics provides a richer model of uncertainty and is amenable to fine-tuned bias correction.

Rather than estimating a scalar expected return $Q(s,a)$ , TQC approximates the return distribution $Z^\pi(s,a)$ as a finite mixture of quantiles. For a continuous-action RL agent, the Q-function is represented by a set of $M$ quantile locations per critic (also called "atoms"). This quantile representation enables both aleatoric uncertainty modeling and flexible manipulation of the value distribution's tails for bias control.

2. Core Algorithmic Components

TQC blends three mechanisms to ensure precise overestimation control:

Distributional Representation: Each critic predicts $M$ quantile values (atoms) $\{\theta_n^m(s,a)\}_{m=1}^M$ per critic $n=1,\ldots,N$ ; the joint ensemble provides $N \times M$ atoms per state-action pair.
Tail Truncation: To reduce overestimated value targets, TQC sorts all $N \times M$ target atoms and discards a fixed number $d$ of the largest atoms from each critic. This truncation step retains only the $kN$ smallest atoms ( $k = M-d$ per critic) before target computation. The degree of truncation provides direct control of optimism in value estimation.
Ensembling: Instead of simply taking the minimum over multiple critics (as in TD3), TQC aggregates all atoms, sorts them, and performs truncation across the pooled ensemble. This aggregation further subdues erratic predictions and improves robustness.

The mathematical formulation is as follows:

Value Distribution:

$Z_{\psi_n}(s, a) = \frac{1}{M}\sum_{m=1}^M \delta(\theta_n^m(s, a))$

where $\delta(\cdot)$ is the Dirac delta function, and $\theta_n^m(s, a)$ are quantile predictions.

Target Computation:
- Pool all $N \times M$ target atoms from $N$ target critics for the next state-action pair, sort them, and discard the $d$ largest atoms per critic.
- For each remaining atom, the target is
$y_i(s,a) = r(s, a) + \gamma \Big[z_{(i)}(s', a') - \alpha \log \pi_{\phi}(a' \mid s')\Big]$

for $i = 1, ..., kN$ , where $z_{(i)}$ are the sorted target atoms.
Critic Loss:
- Quantile regression loss with possibly asymmetric Huber loss $\rho_\tau^H$ :
$J_Z(\psi_n) = \mathbb{E}_{\mathcal{D},\pi}\left[\frac{1}{kNM}\sum_{m=1}^M\sum_{i=1}^{kN} \rho_{\tau_m}^H\big(y_i(s,a) - \theta_n^m(s, a)\big)\right]$

Here $\mathcal{D}$ is the replay buffer, $\tau_m$ the quantile values, and $\psi_n$ the critic parameters.

Actor Update:
- The policy is updated by maximizing the mean (across all ensemble atoms) of the (non-truncated) critic outputs minus the entropy term:
$J_\pi(\phi) = \mathbb{E}_{\mathcal{D},\pi}\left[\alpha \log \pi_\phi(a \mid s) - \frac{1}{NM}\sum_{n=1}^N\sum_{m=1}^M \theta_n^m(s,a)\right]$

3. Performance Evaluation and Empirical Results

TQC has demonstrated performance improvements over prior RL algorithms on continuous control tasks, notably outperforming competing techniques on MuJoCo and other control benchmarks. Empirical findings include:

Significant improvements in final episodic return, up to 30% in certain environments.
Superior performance on the challenging Humanoid benchmark (25% improvement over state-of-the-art), nearly doubling "reward per step until the agent falls."
Reduced bias and variance in single-state MDP settings versus TD3 (minimum operator) and average-based methods.
Robustness established through ablation studies, with both truncation and ensembling independently contributing to stability and return.

TQC’s ability to systematically tune the degree of overestimation via the $d$ parameter allows flexible trade-offs between under- and overestimation, a level of control not available with simply ensembling or taking minima across critics.

4. Practical Implementation Considerations

TQC has been applied to a variety of settings, including quantum control (Perret et al., 2023), automated experimental optics (Richtmann et al., 24 May 2024), and high-dimensional robotics (Dorka et al., 2021). The practical requirements are:

Computational cost: The use of multiple critics and quantiles increases memory and computational requirements, but these costs are offset by increased sample efficiency and stability in learning.
Hyperparameter tuning: The core hyperparameter is the number of quantiles to drop ( $d$ ); this may require environment-specific tuning, though automated adaptive mechanisms (see ACC below) can address this.
Replay buffer and off-policy data usage: Standard off-policy replay buffers and target networks are maintained, with typical hyperparameters (e.g., learning rates $3\times10^{-4}$ , buffer sizes $10^6$ , batch size $256$, $\gamma = 0.99$ ).
Actor update uses the full, non-truncated set of critic outputs for policy improvement, so truncation is applied during target computation, not for actor loss calculation.

TQC demonstrates enhanced robustness to partial observability, actuator imprecision, and noise (e.g., in optical and quantum systems) due to its uncertainty-aware evaluation and conservative value estimation.

5. Extensions and Adaptive Calibration

A notable extension, Adaptively Calibrated Critics (ACC), removes the need for fixed manual selection of $d$ . ACC uses recent unbiased on-policy rollout returns to calibrate $d$ automatically during training (Dorka et al., 2021):

The number of dropped atoms per critic is set adaptively: $d = d_\text{max} - \beta$ , with $\beta$ updated via

$\beta \leftarrow \beta + \alpha \frac{C}{ma}$

where $C$ is the aggregated discrepancy between Q-value and observed return across a batch, and $ma$ a moving average normalizer.

ACC–TQC achieves state-of-the-art results on continuous control tasks (OpenAI Gym, Meta-World) without the need for per-environment hyperparameter search.

This adaptation further increases the practical deployability of TQC in heterogeneous or changing environments.

6. Applications Across Domains

TQC and its variants have been applied beyond classical control benchmarks:

Quantum control: Preparation of cavity Fock state superpositions, requiring exploitation of measurement back-action and managing stochastic, nonlinear quantum dynamics (Perret et al., 2023).
Autonomous experimental control in optics: Direct alignment of lasers with highly nonlinear, partially observable, and noisy feedback, achieving coupling performance comparable to human experts (Richtmann et al., 24 May 2024).
Energy-aware underwater vehicle control: Fully end-to-end 6-DOF AUV control with and without explicit energy (power) minimization, with TQC controllers showing both higher performance and significant power reductions relative to PID controllers (Boré et al., 25 Feb 2025).

These applications indicate TQC’s capability to handle uncertainty, partial observability, actuator imprecision, and environment-specific bias requirements.

7. Limitations and Future Directions

TQC’s computational complexity, tied to the number of critics and quantile outputs, may be a limiting factor for certain real-time or edge-compute deployments. The current approach to bias control via truncation or adaptive calibration could be further refined, particularly to decouple bias control from ensembling or to reduce reliance on on-policy data for calibration.

Proposed directions for further research include:

Enhanced theoretical analysis of the interplay between aleatoric uncertainty and overestimation bias.
Reducing computational overhead of ensembling while maintaining bias control.
Application of TQC to high-dimensional, real-world domains, such as multi-agent systems or vision-based control.
Hybridization with model-based, PID, or traditional control schemes to combine the advantages of learning-based and classical approaches.

Summary Table: Core Aspects of TQC

Aspect	Description	Paper Reference
Value representation	Quantile-based, via ensemble critics	(Kuznetsov et al., 2020)
Tail truncation	Discards $d$ largest atoms per critic in target computation	(Kuznetsov et al., 2020)
Ensembling	Pooled atoms from $N$ critics, sorted and truncated	(Kuznetsov et al., 2020)
Adaptive calibration	Online $\beta$ adaptation (ACC) for dropped quantiles	(Dorka et al., 2021)
Application domains	Control, robotics, quantum, optics, energy-constrained AUV	(Richtmann et al., 24 May 2024, Boré et al., 25 Feb 2025)

TQC represents a significant development in distributional reinforcement learning, rigorously addressing the overestimation bias underlying many actor-critic algorithms. The algorithm’s design—distributional critics, quantile truncation, and ensembling—enables robust performance in the face of stochasticity and nonlinearity, with demonstrated benefits across a spectrum of challenging real-world RL tasks.