Papers
Topics
Authors
Recent
Search
2000 character limit reached

Decoupled Q-chunking (DQC) in Offline RL

Updated 1 April 2026
  • Decoupled Q-chunking (DQC) is a reinforcement learning method that separates the critic's long-horizon value estimation from the policy's short-horizon, closed-loop action selection.
  • It addresses bootstrapping bias and open-loop execution issues by using a multi-step backup for value propagation and a partial-chunk critic for reactive, short-step decisions.
  • Empirical evaluations on benchmark tasks demonstrate that DQC outperforms standard TD, n‐step, and full Q-chunking methods by delivering improved stability and robust performance.

Decoupled Q-chunking (DQC) is a reinforcement learning (RL) algorithm designed to achieve robust and reactive policy learning in long-horizon, goal-conditioned offline environments by decoupling the chunk length of value estimation from that of policy execution. DQC advances over standard temporal-difference (TD) and chunked critic (Q-chunking) methods by combining large-horizon, multi-step value propagation with a closed-loop, short-horizon policy, thereby addressing core limitations related to bootstrapping bias, open-loop execution, and policy reactivity (Li et al., 11 Dec 2025).

1. Background and Motivation

TD methods compute value estimates by recursively bootstrapping from their outputs, but this leads to bootstrapping bias—errors in the value target accumulate over multiple steps, particularly in long-horizon tasks. Multi-step approaches (e.g., n-step returns) partially alleviate this bias but induce high off-policy error. Chunked critics (Q-chunking) learn to estimate the value of entire HH-step action sequences, enabling more rapid value propagation without introducing off-policy bias:

Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]

Extracting policies from such critics requires open-loop output of the complete HH-step chunk, which is both difficult to learn and ill-suited for environments needing high reactivity or where chunk lengths grow large.

The core innovation in DQC is to decouple the chunk length used for critic value propagation (HH) from that used for the policy to output action sequences (kHk \ll H), and distill the information from the long-horizon chunked critic into a partial-chunk critic, thereby enabling closed-loop short-horizon action selection (Li et al., 11 Dec 2025).

2. Formal Framework and Notation

DQC is built on the following formal components:

  • Chunked Critic: Qc(s,a0:H1)Q_c(s, a_{0:H-1}), estimating the expected return for HH-step action sequences.
  • Multi-step Value Backup:

Qc(s,a0:H1)Es1:H,aH:2H1PD[i=0H1γir(si,ai)+γHmaxaH:2H1Qc(sH,aH:2H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E}_{s_{1:H}, a_{H:2H-1} \sim P_D} \left[ \sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a_{H:2H-1}} Q_c(s_H, a_{H:2H-1}) \right]

  • Partial-chunk Critic: Qp(s,a0:k1)Q_p(s, a_{0:k-1}) for k<Hk < H, approximating the optimistic return when a partial chunk is optimally completed:

Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]0

  • Policy: Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]1, outputs Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]2 actions per step, with Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]3.

This structural decoupling allows the policy to operate reactively while still benefitting from accelerated multi-step value propagation through the chunked critic.

3. Implicit Maximization and Loss Functions

DQC trains Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]4 without explicit maximization over Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]5, using an optimism-based implicit-max loss Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]6 (e.g., expectile or quantile regression):

Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]7

At its optimum, Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]8 approaches the supremum over completions of the partial chunk. Three loss functions define the DQC optimization:

  • Chunked Critic Fitting:

Qc(s,a0:H1)E[i=0H1γir(si,ai)+γHmaxaQc(sH,a0:H1)]Q_c(s, a_{0:H-1}) \approx \mathbb{E} \left[\sum_{i=0}^{H-1} \gamma^i r(s_i, a_i) + \gamma^H \max_{a'} Q_c(s_H, a'_{0:H-1}) \right]9

HH0

  • Policy Extraction (Actor Step):

HH1

Optionally, a value network HH2 can be trained to further stabilize training by approximating HH3 with an additional expectile or quantile loss.

4. Algorithmic Workflow and Implementation Details

The algorithm relies on a four-module architecture involving critic, partial-critic, value, and behavior prior networks, all instantiated as 4-layer MLPs of 1024 units with ReLU activations. The workflow per gradient step is:

  1. Update HH4 by minimizing the multi-step Bellman error using TD backup with HH5.
  2. Distill HH6 into HH7 by minimizing the implicit-max loss over sampled batch trajectories.
  3. Optionally update HH8 with a similar implicit-max loss on top of HH9.
  4. Extract policy actions via best-of-HH0 sampling using a flow-matched behavior prior HH1 (test-time HH2).

Hyperparameters include HH3 for cube/puzzle, HH4 for humanoid; HH5 selected from HH6; expectile regression for distillation (HH7), and quantile regression for value backup (HH8). The optimizer is Adam with learning rate HH9 and batch size kHk \ll H0. Flow policies for priors are trained using flow-matching.

5. Empirical Evaluation and Results

DQC is empirically evaluated on the OGBench suite (cube-triple, cube-quad, cube-oct, humanoid-giant, puzzle-4×5, puzzle-4×6) and compared against a comprehensive set of baselines: one-step TD (OS), n-step returns (NS), full Q-chunking (QC), naive decoupling (QC-naïve), QC+NS, and offline GCRL methods (FBC, HFBC, IQL, HIQL, SHARSA). Aggregate success rates (mean ± 95% CI, 10 seeds) show that DQC consistently outperforms all baselines:

Task DQC QC NS OS SHARSA
cube-triple 98% 20% 93% 47% 83%
cube-quadruple 92% 35% 27% 0% 64%
cube-octuple 34% 0% 9% 0% 34%
humanoid-giant 92% 48% 95% 0% 19%
puzzle-4×5 96% 20% 93% 19% 1%
puzzle-4×6 83% 28% 91% 19% 64%

Ablation studies demonstrate the necessity of implicit-max distillation; non-distilled partial critics underperform. Success is sensitive to batch size (large batches crucial), kHk \ll H1 for best-of-kHk \ll H2 sampling (kHk \ll H3 necessary), and optimism level in the loss (kHk \ll H4, kHk \ll H5 optimal). Absence of optimism (kHk \ll H6) yields failure.

6. Theoretical Properties and Guarantees

DQC analysis is grounded in the concept of open-loop consistency (OLC), which bounds the value estimation bias of the chunked critic:

kHk \ll H7

Strong OLC (Definition 2) implies a sub-optimality gap of chunked Q-learning bounded by kHk \ll H8 (Theorem 4), while closed-loop execution yields an additional factor-kHk \ll H9 overhead (Theorem 8). Under bounded optimality variability (Definition 5), the closed-loop bound improves to Qc(s,a0:H1)Q_c(s, a_{0:H-1})0 (Theorem 9).

A comparison to n-step returns demonstrates that, for Qc(s,a0:H1)Q_c(s, a_{0:H-1})1-sub-optimal data, chunked Q-chunked policy provably outperforms n-step policy when Qc(s,a0:H1)Q_c(s, a_{0:H-1})2 (Theorem 6). DQC does not require explicit Qc(s,a0:H1)Q_c(s, a_{0:H-1})3-step closed-loop policy execution, resulting in improved stability and reactivity for long-horizon offline RL.

7. Practical Implications and Interpretations

DQC enables learning policies that are both sample efficient and highly reactive in challenging long-horizon tasks. By separating the chunk length for value updates from that for policy execution, DQC sidesteps the often prohibitive modeling and sub-optimality issues associated with open-loop sequence prediction. In summary, DQC combines the rapid value propagation of multi-step Q-chunking with closed-loop, short-chunk policies, delivering stable and robust performance while removing critical barriers to scaling chunked critics in practical long-horizon RL applications (Li et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Decoupled Q-chunking (DQC).