Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 457 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Truncated Quantile Critic (TQC) Overview

Updated 3 October 2025
  • TQC is a deep reinforcement learning algorithm designed to reduce overestimation bias by truncating overly optimistic critic predictions through an ensemble approach.
  • It employs distributional quantile regression to model the full return distribution, capturing aleatoric uncertainty in rewards and transitions.
  • Empirical results show TQC outperforms traditional actor-critic methods on high-dimensional continuous control tasks with robust and sample-efficient performance.

Truncated Quantile Critic (TQC) is a model-free deep reinforcement learning algorithm designed to address overestimation bias in value-based off-policy continuous control. It unifies three key mechanisms: a distributional representation of the value function via quantile regression, aggressive truncation of overly optimistic critic predictions, and the use of an ensemble of independent critics. TQC’s synergy of distributional learning and controlled quantile truncation yields highly robust policy improvement, outperforming prevailing actor-critic methods across standard high-dimensional benchmarks and enabling new applications beyond the original continuous control domain.

1. Theoretical Foundations: Distributional Critic and Quantile Regression

TQC employs a distributional perspective on the state–action value function—rather than learning the expected value Q(s,a)Q(s,a) directly, the critic parameterizes the full distribution of returns Zπ(s,a)Z^\pi(s,a) via quantile regression. Each critic network in the ensemble outputs MM quantile values ("atoms") for a given (s,a)(s,a) pair. This distributional parameterization is formalized as

Zψ(s,a)=1Mm=1Mδ(θψm(s,a)),Z_\psi(s,a) = \frac{1}{M} \sum_{m=1}^M \delta\left(\theta^{m}_\psi(s,a)\right),

where θψm(s,a)\theta_\psi^m(s,a) denotes the mm-th quantile prediction of the critic. The algorithm trains the critic to minimize the expectile-adjusted quantile Huber loss between predicted and target quantiles: JZ(ψn)=E(s,a)[1kNMm=1Mi=1kNρτmH(yi(s,a)θψnm(s,a))],J_Z(\psi_n) = \mathbb{E}_{(s,a)} \left[ \frac{1}{kNM} \sum_{m=1}^M \sum_{i=1}^{kN} \rho_{\tau_m}^{H}\big(y_i(s,a) - \theta^m_{\psi_n}(s,a)\big) \right], where kk and NN are truncation and ensemble parameters (see below), and ρτH\rho_\tau^H denotes the Huberized quantile regression loss.

This distributional formulation captures aleatoric uncertainty in rewards and transitions, and enables granular control over which parts of the value distribution inform policy updates.

2. Truncation Scheme: Reducing Overestimation Bias

The crux of TQC’s bias mitigation is the truncation of extreme critic predictions during the target update. At each BeLLMan backup, the algorithm pools the N×MN \times M quantiles ("atoms") of all critics for the next state–action pair (s,a)(s',a'), producing the set

Z(s,a):={θψnm(s,a)n=1,,N;m=1,,M}.\mathcal{Z}(s',a') := \left\{\theta_{\psi_n}^m(s', a') \mid n = 1, \ldots, N;\, m = 1, \ldots, M \right\}.

After sorting all NMNM atoms in ascending order, only the kNkN smallest are used to define the target distribution: yi(s,a)=r(s,a)+γ[z(i)(s,a)αlogπϕ(as)],i=1,...,kN,y_i(s,a) = r(s,a) + \gamma \left[ z_{(i)}(s', a') - \alpha \log \pi_\phi(a' \mid s') \right], \quad i = 1, ..., kN,

Y(s,a)=1kNi=1kNδ(yi(s,a)).Y(s,a) = \frac{1}{kN} \sum_{i=1}^{kN} \delta(y_i(s,a)).

By discarding a controlled fraction (\sim8% in the original work) of the most optimistic quantile targets, TQC achieves precise regulation of value estimation optimism, offering a continuum from no truncation (fully optimistic) to aggressive underestimation.

3. Ensemble of Independent Critics

TQC leverages an ensemble of NN independently parameterized critic networks. This ensemble serves two purposes:

  • Variance Reduction and Robustness: By aggregating quantile predictions across different critics, variance in target estimates is reduced, with outlier predictions (e.g., from "deviant" critics) more readily identified and truncated.
  • Enhanced Bias Control: The truncation process applies to the pooled quantile atoms from all critics, not per-critic, offering a stronger regularization effect.

This design decouples control of estimation bias (via truncation) from the benefits of ensembling, with performance benefits observed even for single-critic cases, but further enhanced by increasing NN.

4. Mathematical Model and Loss Structure

The key mathematical structures in TQC are:

  • Return Distribution Approximation:

Zψn(s,a)=1Mm=1Mδ(θψnm(s,a))Z_{\psi_n}(s,a) = \frac{1}{M} \sum_{m=1}^M \delta(\theta_{\psi_n}^m(s,a))

  • Target Construction via Truncation:

Z(s,a)={θψnm(s,a):n=1..N,m=1..M},{z(i)}i=1NM=sort(Z)\mathcal{Z}(s', a') = \{\theta_{\psi_n}^m(s', a') : n = 1..N, m = 1..M \},\quad \{z_{(i)}\}_{i=1}^{NM} = \operatorname{sort}(\mathcal{Z})

{yi(s,a)}i=1kN=r(s,a)+γ[z(i)(s,a)αlogπϕ(as)]\{y_i(s,a)\}_{i=1}^{kN} = r(s,a) + \gamma[z_{(i)}(s',a') - \alpha \log \pi_\phi(a'|s')]

  • Critic Loss:

JZ(ψn)=E[1kNMm=1Mi=1kNρτmH(yi(s,a)θψnm(s,a))]J_Z(\psi_n) = \mathbb{E}\left[ \frac{1}{kNM} \sum_{m=1}^M \sum_{i=1}^{kN} \rho_{\tau_m}^H(y_i(s,a) - \theta_{\psi_n}^m(s,a)) \right]

  • Policy Loss (no truncation):

Jπ(ϕ)=E[αlogπϕ(as)1NMn=1Nm=1Mθψnm(s,a)]J_\pi(\phi) = \mathbb{E} \left[ \alpha \log \pi_\phi(a|s) - \frac{1}{NM} \sum_{n=1}^N \sum_{m=1}^M \theta_{\psi_n}^m(s,a) \right]

The explicit separation in the critic and policy objective, with truncation only applied for value update, avoids "double truncation" and aligns with the algorithm's bias control purpose.

5. Empirical Results and Comparative Performance

TQC was benchmarked on MuJoCo continuous control environments, achieving large improvements over prior algorithms such as Soft Actor-Critic (SAC) and TD3, including variants using a "min" operator across critics. Notably, in the challenging Humanoid environment, TQC produced approximately 25% greater average return than prior methods, where this represents "twice the running speed" before agent failure under a per-timestep reward structure. Results indicate that TQC’s selective quantile truncation directly controls the optimism of Q-value targets and leads to more stable and performant policies, especially in high-variance, high-dimensional tasks.

Subsequent research has adapted and extended TQC:

  • Adaptively Calibrated Critics (ACC): (Dorka et al., 2021) applies online adjustment of the truncation parameter by comparing the critic's estimates to unbiased high-variance on-policy returns. The calibration variable β\beta modulates the number of dropped atoms and is updated via

ββα(Q(s,a)R(s,a)/normalization),\beta \leftarrow \beta - \alpha(\langle Q(s,a) - R(s,a) \rangle / \text{normalization}),

automatically regulating the bias without manual tuning.

  • Aggressive Q-Learning with Ensembles (AQE): (Wu et al., 2021) replaces distributional quantiles with mean-ensembling over the KK lowest Q-values in a set of NN critics, achieving similar (and sometimes greater) bias control and sample efficiency, with reduced algorithmic complexity but without the explicit distributional representation of TQC.
  • Domain Extensions: TQC has been used in quantum control (for feedback preparation of high-fidelity cavity superpositions (Perret et al., 2023)), experimental optical control (fiber coupling with noisy actions (Richtmann et al., 24 May 2024)), UAV pursuit and interception under realistic flight dynamics (Giral et al., 9 Jul 2024), and 6-DOF autonomous underwater vehicle control with integrated power-awareness in the reward (Boré et al., 25 Feb 2025).

7. Applications and Algorithmic Insights

Practically, TQC’s method of learning a quantile-based return distribution and truncating optimistic tails is found to confer:

  • Robustness to Noise and Nonlinearities: The conservative Q-value estimation by truncation mitigates instability from environmental stochasticity (as in fiber coupling tasks) and from high-variance back-action in quantum control feedback loops.
  • Efficiency in High-Dimensional and Continuous Spaces: Ensemble and truncation jointly permit sample-efficient learning in continuous domains, outperforming non-distributional, non-ensemble methods.
  • Flexibility for Reward and Constraint Integration: Application to AUVs demonstrates the ease of incorporating multi-objective constraints (tracking accuracy, control smoothness, energy consumption) in the reward structure.
  • Compatibility with On-line Summarization Techniques: Distribution compression techniques such as t-digests (Dunning et al., 2019) are directly relevant to storing and aggregating quantile summaries in large-scale distributed TQC implementations.

Summary Table: Algorithmic Components of TQC

Component Function Implementation Role
Quantile regression critic Models full return distribution Informs target construction and uncertainty estimation
Truncation of quantiles Discards optimistic prediction outliers Controls optimism/overestimation bias in BeLLMan targets
Ensemble of critics Aggregates multiple independent critics Reduces variance; enables robust bias-reduction via truncation

TQC’s architectural choices and rigorous bias management have established it as a highly influential method in both academic research and practical applications for robust, sample-efficient reinforcement learning in high-dimensional, stochastic environments.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Truncated Quantile Critic (TQC).