Decoupled Q-Chunking (DQC)
- Decoupled Q-Chunking (DQC) is a reinforcement learning algorithm that decouples policy and critic chunk lengths, enabling long-horizon value propagation and responsive, closed-loop control.
- It leverages implicit maximization and quantile regression to mitigate bootstrapping bias and improve stability in offline, goal-conditioned environments.
- Experimental evaluations show that DQC outperforms traditional methods with reduced variance and improved sample efficiency, while highlighting future opportunities for adaptive chunking.
Decoupled Q-Chunking (DQC) is a reinforcement learning algorithm designed to improve the efficiency of temporal-difference (TD) learning in offline, long-horizon goal-conditioned environments, particularly where standard self-bootstrapping methods introduce bootstrapping bias through cumulative errors in value targets. DQC extends chunked-critic approaches by decoupling policy and critic chunk lengths—allowing the value function to propagate over long action chunks while policies operate on short, reactive sequences. The central mechanism involves distilling a partial (prefix) critic by "optimistically" backing up values from the main chunked critic, leveraging implicit maximization and quantile regression to support closed-loop reactive policy execution and robust value estimation for challenging tasks (Li et al., 11 Dec 2025).
1. Core Definitions and Framework
DQC is formalized in the context of a Markov Decision Process (MDP) with state space , action space , transition kernel , bounded reward , and discount factor . The algorithm operates on an offline dataset comprising trajectories .
Chunked Critic : Estimates the value for an -step open-loop action sequence:
where denotes a target network or, alternatively, an implicit value function.
Partial (Distilled) Critic : For chunk prefix length , estimates the maximal value achievable by optimistically filling in the unobserved remainder of the chunk:
Policy : Generates action chunks of length , but during execution, only the first action of each chunk is taken, implementing a closed-loop control regime.
Behavior Prior : A flow-based action density fit to the offline dataset , used to regularize the policy in the offline setting.
2. Algorithmic Procedure and Losses
DQC involves three principal network classes—chunked critic , partial/distilled critic , and value-backup network —with separate but coordinated training procedures.
- Chunked-Critic Bellman Backup: is trained with the -step TD target,
where denotes the cumulative discounted reward for the -step chunk.
- Distillation of the Partial Critic: For each partial prefix , an “optimistic” target is computed by maximizing over possible completions. Instead of explicit maximization, an asymmetric squared error loss (expectile or quantile) encourages to regress upwards toward the maximum of :
- Value-Backup Network: is updated using an implicit quantile regression loss on :
3. Policy Extraction and Optimization
Rather than direct parametric training of , DQC relies on best-of- sampling from the behavior prior :
- At evaluation time, candidate chunk prefixes are sampled.
- The candidate maximizing is selected for execution.
This approach eschews adversarial policy-value exploitation—an issue encountered in offline RL with explicit policy objectives. In principle, entropy or other regularization can be incorporated if training a parametric .
4. Experimental Evaluation
DQC was evaluated on six challenging long-horizon goal-conditioned environments drawn from OGBench, including cube-triple, cube-quadruple, cube-octuple (robot-arm cube reconfiguration), humanoidmaze-giant (4000-step humanoid navigation), and puzzle-4×5 / 4×6 button-flip tasks. Datasets comprised between 3 million and 1 billion transitions.
Baselines included:
- OS (1-step TD / IQL), NS (n-step TD), QC (Q-chunking with ), DQC-naïve (QC policy, execute only prefix)
- SHARSA (previous SOTA hierarchical RL), IQL/HIQL (implicit Q), FBC/HFBC (flow behavior cloning)
Metrics were success rate averaged across 5 tasks × 50 seeds and aggregate OGBench scores.
Key findings:
- DQC with or consistently surpassed all baselines.
- Ablations indicated that the distilled critic was essential (QC-NS vs DQC).
- DQC reduced percentile variance, providing more stable performance (Li et al., 11 Dec 2025).
5. Practical Hyperparameters and Implementation
Key hyperparameters and practical guidance include:
- Implicit losses: Expectile for distillation (–$0.8$), quantile for value backup (–$0.97$).
- Best-of- sampling: for candidate chunk evaluation.
- Batch size: $4096$ critical for stable training in complex environments.
- Chunk lengths: Critic chunk length (e.g., or ).
- Computation: Critic updates are , distillation adds one extra pass on chunk prefixes, policy extraction cost is .
DQC is robust to most hyperparameter choices as long as loss asymmetry (optimism) is maintained ; performance degrades if . A trade-off exists between computational cost and best-of-, batch size.
6. Limitations and Research Directions
DQC currently uses fixed across all states, which may not be optimal in environments with variable horizon structure. Adaptive or state-dependent chunk lengths are suggested as a possible improvement.
Extensions include:
- Parametric policy training against for online or fine-tuning settings.
- Incorporation of stochastic or model-based rollouts for filling in the unobserved portion of chunks.
- Further investigation into optimizing computational efficiency, especially as chunk length grows.
A plausible implication is that DQC's decoupled chunking mechanism can be generalized to additional settings where the separation of value propagation and policy reactivity is beneficial for learning stability and sample efficiency.
7. Position in Offline RL Literature
DQC addresses challenges in previous chunked-critic approaches by eliminating the requirement for open-loop chunk policies—improving reactivity and mitigating sub-optimality for long action chunks. Its structure allows for efficient multi-step value propagation while preserving compatibility with reactive, closed-loop policies, distinguishing it from both standard multi-step TD and open-loop chunking algorithms. The empirical advances and ablation studies demonstrate the substantive impact of the distilled partial critic and decoupling mechanism on both sample efficiency and reliability in long-horizon offline RL (Li et al., 11 Dec 2025).