Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decoupled Q-Chunking (DQC)

Updated 28 February 2026
  • Decoupled Q-Chunking (DQC) is a reinforcement learning algorithm that decouples policy and critic chunk lengths, enabling long-horizon value propagation and responsive, closed-loop control.
  • It leverages implicit maximization and quantile regression to mitigate bootstrapping bias and improve stability in offline, goal-conditioned environments.
  • Experimental evaluations show that DQC outperforms traditional methods with reduced variance and improved sample efficiency, while highlighting future opportunities for adaptive chunking.

Decoupled Q-Chunking (DQC) is a reinforcement learning algorithm designed to improve the efficiency of temporal-difference (TD) learning in offline, long-horizon goal-conditioned environments, particularly where standard self-bootstrapping methods introduce bootstrapping bias through cumulative errors in value targets. DQC extends chunked-critic approaches by decoupling policy and critic chunk lengths—allowing the value function to propagate over long action chunks while policies operate on short, reactive sequences. The central mechanism involves distilling a partial (prefix) critic by "optimistically" backing up values from the main chunked critic, leveraging implicit maximization and quantile regression to support closed-loop reactive policy execution and robust value estimation for challenging tasks (Li et al., 11 Dec 2025).

1. Core Definitions and Framework

DQC is formalized in the context of a Markov Decision Process (MDP) with state space S\mathcal{S}, action space A\mathcal{A}, transition kernel T(ss,a)T(s'|s, a), bounded reward r(s,a)[0,1]r(s, a) \in [0, 1], and discount factor γ[0,1)\gamma \in [0, 1). The algorithm operates on an offline dataset D\mathcal{D} comprising trajectories (st,,st+H),(at,,at+H1),(rt,,rt+H1)(s_t, \ldots, s_{t+H}), (a_t, \ldots, a_{t+H-1}), (r_t, \ldots, r_{t+H-1}).

Chunked Critic Q(n)Q^{(n)}: Estimates the value for an nn-step open-loop action sequence:

Qϕ(n)(st,at:t+n1)E[k=0n1γkrt+k+γnmaxa0:n1Qϕˉ(n)(st+n,a0:n1)]Q^{(n)}_\phi(s_t, a_{t:t+n-1}) \approx \mathbb{E}\Big[\sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \max_{a'_{0:n-1}} Q^{(n)}_{\bar{\phi}}(s_{t+n}, a'_{0:n-1})\Big]

where ϕˉ\bar{\phi} denotes a target network or, alternatively, an implicit value function.

Partial (Distilled) Critic Q~(k)\tilde{Q}^{(k)}: For chunk prefix length knk \leq n, estimates the maximal value achievable by optimistically filling in the unobserved remainder of the chunk:

Q~ψ(st,at:t+k1)maxat+k:t+n1Qϕ(n)(st,[at:t+k1,at+k:t+n1])\tilde{Q}_\psi(s_t, a_{t:t+k-1}) \approx \max_{a_{t+k:t+n-1}} Q^{(n)}_\phi(s_t, [a_{t:t+k-1}, a_{t+k:t+n-1}])

Policy π\pi: Generates action chunks at:t+k1a_{t:t+k-1} of length knk \leq n, but during execution, only the first action of each chunk is taken, implementing a closed-loop control regime.

Behavior Prior πβ\pi_\beta: A flow-based action density fit to the offline dataset D\mathcal{D}, used to regularize the policy in the offline setting.

2. Algorithmic Procedure and Losses

DQC involves three principal network classes—chunked critic Qϕ(n)Q^{(n)}_\phi, partial/distilled critic Q~ψ\tilde{Q}_\psi, and value-backup network VξV_\xi—with separate but coordinated training procedures.

  1. Chunked-Critic Bellman Backup: Qϕ(n)Q^{(n)}_\phi is trained with the nn-step TD target,

LQ=E[(Qϕ(n)(st,at:t+n1)[Rt:t+n1+γnVξ(st+n)])2]L_Q = \mathbb{E}\Big[(Q^{(n)}_\phi(s_t, a_{t:t+n-1}) - [R_{t:t+n-1} + \gamma^n V_\xi(s_{t+n})])^2\Big]

where Rt:t+n1R_{t:t+n-1} denotes the cumulative discounted reward for the nn-step chunk.

  1. Distillation of the Partial Critic: For each partial prefix at:t+k1a_{t:t+k-1}, an “optimistic” target is computed by maximizing over possible completions. Instead of explicit maximization, an asymmetric squared error loss (expectile or quantile) encourages Q~ψ\tilde{Q}_\psi to regress upwards toward the maximum of Qϕ(n)Q^{(n)}_\phi:

LQ~=E[fimp(Qϕ(n)(st,at:t+n1)Q~ψ(st,at:t+k1))]L_{\tilde{Q}} = \mathbb{E}\Big[f_{\mathrm{imp}}(Q^{(n)}_\phi(s_t, a_{t:t+n-1}) - \tilde{Q}_\psi(s_t, a_{t:t+k-1}))\Big]

  1. Value-Backup Network: VξV_\xi is updated using an implicit quantile regression loss on Q~ψ\tilde{Q}_\psi:

LV=E[fquantile(Q~ψ(s,a0:k1)Vξ(s))]L_V = \mathbb{E}\Big[f_{\mathrm{quantile}}(\tilde{Q}_\psi(s, a_{0:k-1}) - V_\xi(s))\Big]

3. Policy Extraction and Optimization

Rather than direct parametric training of π\pi, DQC relies on best-of-NN sampling from the behavior prior πβ\pi_\beta:

  • At evaluation time, NN candidate chunk prefixes a0:k1(i)πβ(s)a^{(i)}_{0:k-1} \sim \pi_\beta(\cdot|s) are sampled.
  • The candidate maximizing Q~ψ(s,a(i))\tilde{Q}_\psi(s, a^{(i)}) is selected for execution.

This approach eschews adversarial policy-value exploitation—an issue encountered in offline RL with explicit policy objectives. In principle, entropy or other regularization can be incorporated if training a parametric π\pi.

4. Experimental Evaluation

DQC was evaluated on six challenging long-horizon goal-conditioned environments drawn from OGBench, including cube-triple, cube-quadruple, cube-octuple (robot-arm cube reconfiguration), humanoidmaze-giant (4000-step humanoid navigation), and puzzle-4×5 / 4×6 button-flip tasks. Datasets comprised between 3 million and 1 billion transitions.

Baselines included:

  • OS (1-step TD / IQL), NS (n-step TD), QC (Q-chunking with k=nk=n), DQC-naïve (QC policy, execute only kk prefix)
  • SHARSA (previous SOTA hierarchical RL), IQL/HIQL (implicit Q), FBC/HFBC (flow behavior cloning)

Metrics were success rate averaged across 5 tasks × 50 seeds and aggregate OGBench scores.

Key findings:

  • DQC with n=25,k=5n=25, k=5 or k=1k=1 consistently surpassed all baselines.
  • Ablations indicated that the distilled critic was essential (QC-NS vs DQC).
  • DQC reduced percentile variance, providing more stable performance (Li et al., 11 Dec 2025).

5. Practical Hyperparameters and Implementation

Key hyperparameters and practical guidance include:

  • Implicit losses: Expectile for distillation (κd0.5\kappa_d \approx 0.5–$0.8$), quantile for value backup (κb0.7\kappa_b \approx 0.7–$0.97$).
  • Best-of-NN sampling: N32N \approx 32 for candidate chunk evaluation.
  • Batch size: $4096$ critical for stable training in complex environments.
  • Chunk lengths: Critic chunk length nkn \gg k (e.g., n=25,k=1n=25, k=1 or k=5k=5).
  • Computation: Critic updates are O(n)O(n), distillation adds one extra pass on chunk prefixes, policy extraction cost is O(Nk)O(Nk).

DQC is robust to most hyperparameter choices as long as loss asymmetry (optimism) is maintained (κ0.5)(\kappa \neq 0.5); performance degrades if κb=κd=0.5\kappa_b=\kappa_d=0.5. A trade-off exists between computational cost and best-of-NN, batch size.

6. Limitations and Research Directions

DQC currently uses fixed (n,k)(n, k) across all states, which may not be optimal in environments with variable horizon structure. Adaptive or state-dependent chunk lengths are suggested as a possible improvement.

Extensions include:

  • Parametric policy training against Q~ψ\tilde{Q}_\psi for online or fine-tuning settings.
  • Incorporation of stochastic or model-based rollouts for filling in the unobserved portion of chunks.
  • Further investigation into optimizing computational efficiency, especially as chunk length grows.

A plausible implication is that DQC's decoupled chunking mechanism can be generalized to additional settings where the separation of value propagation and policy reactivity is beneficial for learning stability and sample efficiency.

7. Position in Offline RL Literature

DQC addresses challenges in previous chunked-critic approaches by eliminating the requirement for open-loop chunk policies—improving reactivity and mitigating sub-optimality for long action chunks. Its structure allows for efficient multi-step value propagation while preserving compatibility with reactive, closed-loop policies, distinguishing it from both standard multi-step TD and open-loop chunking algorithms. The empirical advances and ablation studies demonstrate the substantive impact of the distilled partial critic and decoupling mechanism on both sample efficiency and reliability in long-horizon offline RL (Li et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoupled Q-Chunking (DQC).