Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 62 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 20 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 75 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 457 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Truncated Quantile Critic (TQC) Overview

Updated 3 October 2025

TQC is a deep reinforcement learning algorithm designed to reduce overestimation bias by truncating overly optimistic critic predictions through an ensemble approach.
It employs distributional quantile regression to model the full return distribution, capturing aleatoric uncertainty in rewards and transitions.
Empirical results show TQC outperforms traditional actor-critic methods on high-dimensional continuous control tasks with robust and sample-efficient performance.

Truncated Quantile Critic (TQC) is a model-free deep reinforcement learning algorithm designed to address overestimation bias in value-based off-policy continuous control. It unifies three key mechanisms: a distributional representation of the value function via quantile regression, aggressive truncation of overly optimistic critic predictions, and the use of an ensemble of independent critics. TQC’s synergy of distributional learning and controlled quantile truncation yields highly robust policy improvement, outperforming prevailing actor-critic methods across standard high-dimensional benchmarks and enabling new applications beyond the original continuous control domain.

1. Theoretical Foundations: Distributional Critic and Quantile Regression

TQC employs a distributional perspective on the state–action value function—rather than learning the expected value $Q(s,a)$ directly, the critic parameterizes the full distribution of returns $Z^\pi(s,a)$ via quantile regression. Each critic network in the ensemble outputs $M$ quantile values ("atoms") for a given $(s,a)$ pair. This distributional parameterization is formalized as

$Z_\psi(s,a) = \frac{1}{M} \sum_{m=1}^M \delta\left(\theta^{m}_\psi(s,a)\right),$

where $\theta_\psi^m(s,a)$ denotes the $m$ -th quantile prediction of the critic. The algorithm trains the critic to minimize the expectile-adjusted quantile Huber loss between predicted and target quantiles: $J_Z(\psi_n) = \mathbb{E}_{(s,a)} \left[ \frac{1}{kNM} \sum_{m=1}^M \sum_{i=1}^{kN} \rho_{\tau_m}^{H}\big(y_i(s,a) - \theta^m_{\psi_n}(s,a)\big) \right],$ where $k$ and $N$ are truncation and ensemble parameters (see below), and $\rho_\tau^H$ denotes the Huberized quantile regression loss.

This distributional formulation captures aleatoric uncertainty in rewards and transitions, and enables granular control over which parts of the value distribution inform policy updates.

2. Truncation Scheme: Reducing Overestimation Bias

The crux of TQC’s bias mitigation is the truncation of extreme critic predictions during the target update. At each BeLLMan backup, the algorithm pools the $N \times M$ quantiles ("atoms") of all critics for the next state–action pair $(s',a')$ , producing the set

$\mathcal{Z}(s',a') := \left\{\theta_{\psi_n}^m(s', a') \mid n = 1, \ldots, N;\, m = 1, \ldots, M \right\}.$

After sorting all $NM$ atoms in ascending order, only the $kN$ smallest are used to define the target distribution: $y_i(s,a) = r(s,a) + \gamma \left[ z_{(i)}(s', a') - \alpha \log \pi_\phi(a' \mid s') \right], \quad i = 1, ..., kN,$

$Y(s,a) = \frac{1}{kN} \sum_{i=1}^{kN} \delta(y_i(s,a)).$

By discarding a controlled fraction ( $\sim$ 8% in the original work) of the most optimistic quantile targets, TQC achieves precise regulation of value estimation optimism, offering a continuum from no truncation (fully optimistic) to aggressive underestimation.

3. Ensemble of Independent Critics

TQC leverages an ensemble of $N$ independently parameterized critic networks. This ensemble serves two purposes:

Variance Reduction and Robustness: By aggregating quantile predictions across different critics, variance in target estimates is reduced, with outlier predictions (e.g., from "deviant" critics) more readily identified and truncated.
Enhanced Bias Control: The truncation process applies to the pooled quantile atoms from all critics, not per-critic, offering a stronger regularization effect.

This design decouples control of estimation bias (via truncation) from the benefits of ensembling, with performance benefits observed even for single-critic cases, but further enhanced by increasing $N$ .

4. Mathematical Model and Loss Structure

The key mathematical structures in TQC are:

Return Distribution Approximation:

$Z_{\psi_n}(s,a) = \frac{1}{M} \sum_{m=1}^M \delta(\theta_{\psi_n}^m(s,a))$

Target Construction via Truncation:

$\mathcal{Z}(s', a') = \{\theta_{\psi_n}^m(s', a') : n = 1..N, m = 1..M \},\quad \{z_{(i)}\}_{i=1}^{NM} = \operatorname{sort}(\mathcal{Z})$

$\{y_i(s,a)\}_{i=1}^{kN} = r(s,a) + \gamma[z_{(i)}(s',a') - \alpha \log \pi_\phi(a'|s')]$

Critic Loss:

$J_Z(\psi_n) = \mathbb{E}\left[ \frac{1}{kNM} \sum_{m=1}^M \sum_{i=1}^{kN} \rho_{\tau_m}^H(y_i(s,a) - \theta_{\psi_n}^m(s,a)) \right]$

Policy Loss (no truncation):

$J_\pi(\phi) = \mathbb{E} \left[ \alpha \log \pi_\phi(a|s) - \frac{1}{NM} \sum_{n=1}^N \sum_{m=1}^M \theta_{\psi_n}^m(s,a) \right]$

The explicit separation in the critic and policy objective, with truncation only applied for value update, avoids "double truncation" and aligns with the algorithm's bias control purpose.

5. Empirical Results and Comparative Performance

TQC was benchmarked on MuJoCo continuous control environments, achieving large improvements over prior algorithms such as Soft Actor-Critic (SAC) and TD3, including variants using a "min" operator across critics. Notably, in the challenging Humanoid environment, TQC produced approximately 25% greater average return than prior methods, where this represents "twice the running speed" before agent failure under a per-timestep reward structure. Results indicate that TQC’s selective quantile truncation directly controls the optimism of Q-value targets and leads to more stable and performant policies, especially in high-variance, high-dimensional tasks.

Subsequent research has adapted and extended TQC:

Adaptively Calibrated Critics (ACC): (Dorka et al., 2021) applies online adjustment of the truncation parameter by comparing the critic's estimates to unbiased high-variance on-policy returns. The calibration variable $\beta$ modulates the number of dropped atoms and is updated via

$\beta \leftarrow \beta - \alpha(\langle Q(s,a) - R(s,a) \rangle / \text{normalization}),$

automatically regulating the bias without manual tuning.

Aggressive Q-Learning with Ensembles (AQE): (Wu et al., 2021) replaces distributional quantiles with mean-ensembling over the $K$ lowest Q-values in a set of $N$ critics, achieving similar (and sometimes greater) bias control and sample efficiency, with reduced algorithmic complexity but without the explicit distributional representation of TQC.
Domain Extensions: TQC has been used in quantum control (for feedback preparation of high-fidelity cavity superpositions (Perret et al., 2023)), experimental optical control (fiber coupling with noisy actions (Richtmann et al., 24 May 2024)), UAV pursuit and interception under realistic flight dynamics (Giral et al., 9 Jul 2024), and 6-DOF autonomous underwater vehicle control with integrated power-awareness in the reward (Boré et al., 25 Feb 2025).

7. Applications and Algorithmic Insights

Practically, TQC’s method of learning a quantile-based return distribution and truncating optimistic tails is found to confer:

Robustness to Noise and Nonlinearities: The conservative Q-value estimation by truncation mitigates instability from environmental stochasticity (as in fiber coupling tasks) and from high-variance back-action in quantum control feedback loops.
Efficiency in High-Dimensional and Continuous Spaces: Ensemble and truncation jointly permit sample-efficient learning in continuous domains, outperforming non-distributional, non-ensemble methods.
Flexibility for Reward and Constraint Integration: Application to AUVs demonstrates the ease of incorporating multi-objective constraints (tracking accuracy, control smoothness, energy consumption) in the reward structure.
Compatibility with On-line Summarization Techniques: Distribution compression techniques such as t-digests (Dunning et al., 2019) are directly relevant to storing and aggregating quantile summaries in large-scale distributed TQC implementations.

Summary Table: Algorithmic Components of TQC

Component	Function	Implementation Role
Quantile regression critic	Models full return distribution	Informs target construction and uncertainty estimation
Truncation of quantiles	Discards optimistic prediction outliers	Controls optimism/overestimation bias in BeLLMan targets
Ensemble of critics	Aggregates multiple independent critics	Reduces variance; enables robust bias-reduction via truncation

TQC’s architectural choices and rigorous bias management have established it as a highly influential method in both academic research and practical applications for robust, sample-efficient reinforcement learning in high-dimensional, stochastic environments.