Distilled Partial Chunk Critic in Offline RL

Updated 28 February 2026

The Distilled Partial Chunk Critic is a novel value estimation technique that decouples long-horizon TD backup from reactive, closed-loop policy execution.
It employs optimistic expectile regression to distill maximal value predictions from a k-step chunked critic for shorter action blocks, ensuring stable learning.
Empirical results on long-horizon tasks demonstrate that the DQC framework outperforms traditional methods by balancing efficient credit assignment with flexible policy reactivity.

The Distilled Partial Chunk Critic is a value estimation approach in offline reinforcement learning that forms the core of the Decoupled Q-Chunking (DQC) framework. It enables multi-step value propagation by learning a critic over long open-loop action chunks, then distilling a partial-chunk critic that enables closed-loop, reactive policy extraction over much shorter action blocks. This design allows agents to benefit from efficient long-horizon temporal-difference (TD) backup without incurring the optimization and expressivity challenges associated with open-loop policies for long action sequences (Li et al., 11 Dec 2025).

1. Formal Definitions and Notation

The chunked critic framework defines two separate chunk lengths: the critic chunk length $k$ and the policy chunk length $l < k$ . An action chunk of length $k$ is written $a_{1:k} = (a_1,\dots,a_k)$ . The $k$ -step chunked critic, $Q^C(s,a_{1:k})$ , estimates the expected discounted return from executing $a_{1:k}$ open-loop, then continuing optimally in future $k$ -chunks: $Q^{C}(s_{t},a_{t:t+k-1}) \approx \mathbb{E}\left[\sum_{i=0}^{k-1}\gamma^i r(s_{t+i},a_{t+i}) + \gamma^k \max_{a'_{t+k:t+2k-1}} Q^{C}(s_{t+k},a'_{t+k:t+2k-1})\right].$ The Distilled Partial Chunk Critic, denoted $Q^P(s,a_{1:l})$ , approximates the maximal value achievable by optimally completing a given partial chunk: $Q^P(s,a_{1:l}) \approx \max_{a_{l+1:k}} Q^C(s,[a_{1:l},a_{l+1:k}]).$ Only the first $l$ actions are produced and executed per decision step (closed-loop policy), and the process iterates.

2. Distilled Partial-Chunk Backup and Optimistic Regression

The key insight is to perform an "optimistic" backup: for any fixed partial action chunk $a_{1:l}$ , the value assigned is the maximal $Q^C$ obtained by optimally completing the remainder of the chunk: $Q^P(s,a_{1:l}) \leftarrow \max_{a_{l+1:k}} Q^C(s,[a_{1:l},a_{l+1:k}]).$ Direct maximization is intractable; DQC adopts a practical approximation known as "max-plus distillation," implementing implicit-max regression via expectile loss with parameter $\kappa > 0.5$ to bias the estimator upwards toward the maximizing completion. This expectile regression operates as follows: $L_{\text{distill}}(\psi) = \mathbb{E}_{(s,a_{1:k})\sim D}\left[f^{\kappa}_{\text{imp}}(Q^C_{\bar{\phi}}(s,a_{1:k}) - Q^P_{\psi}(s,a_{1:l}))\right],$ where $f^{\kappa}_{\text{imp}}$ denotes the expectile squared-error loss, driving $Q^P$ to match the maximal achievable value obtainable from $Q^C$ .

3. Loss Functions, Optimization, and Policy Extraction

The DQC architecture involves three loss functions:

Chunked Critic TD Loss: For $Q^C$ , standard $k$ -step TD loss is applied:

$L_{\rm QC}(\phi) = \mathbb{E}\left[\left(Q_{\phi}(s,a_{1:k}) - (R_{0:k-1}+\gamma^{k} V(s_k))\right)^2\right].$

Partial Critic Distillation Loss: For $Q^P$ , expectile regression towards the "teacher" $Q^C$ on demonstration chunks:

$L_{\rm distill}(\psi) = \mathbb{E}\left[f^{\kappa_d}_{\text{expectile}}(Q^C_{\phi}(s,a_{1:k}) - Q^P_{\psi}(s,a_{1:l}))\right].$

Value Head Loss: For $V_\xi$ , quantile regression (with high quantile) matches the maximal $Q^P$ over candidate chunks:

$L_{V}(\xi) = \mathbb{E}\left[f^{\kappa_b}_{\text{quantile}}(\bar{Q}^P_{\psi}(s,a_{1:l}) - V_{\xi}(s))\right].$

The policy is not explicitly learned; instead, at test time, a set of $N$ action chunks $\{a_{1:l}^i\}_{i=1}^N$ are sampled from a flow-based behavior prior $\pi_\beta(\cdot|s)$ , and the chunk maximizing $Q^P$ is selected: $a^{*} = \arg\max_{i} Q^P_{\psi}(s,a_{1:l}^i).$

4. Algorithmic Workflow and Pseudocode

The full DQC algorithm for offline batch RL operates as follows:

For each gradient step:
- Sample trajectory-chunks $(s_t...s_{t+k}, a_t...a_{t+k-1}, r_t...r_{t+k-1})$ from memory $D$ .
- Update $Q^C$ parameters via $k$ -step TD loss.
- Update $Q^P$ via expectile distillation from $Q^C$ .
- Update $V_\xi$ via quantile regression towards $Q^P$ .
At test time, for each state $s$ $s$ :
- Draw $N$ candidate $l$ -step chunks from $\pi_\beta$ .
- Accept the chunk with maximal $Q^P$ .

The decoupling of $k$ (backup length) and $l$ (policy horizon) permits the agent to plan long-term with $Q^C$ but act flexibly at short horizons using $Q^P$ .

5. Implementation Details and Architectural Choices

Key instantiations in DQC are as follows:

Network architecture: 4-layer MLP, 1024 units/layer, with ReLU nonlinearities.
Critic ensembles: Size $K=2$ ; minimum ensembling for cube tasks, mean aggregation for maze/puzzle settings.
Chunk sizes: $k=25$ (critic), $l \in \{1,5\}$ (policy, tuned per task).
Loss and optimism: Expectile with $\kappa_d \approx 0.8$ for distillation, quantile with $\kappa_b \approx 0.97$ for value head.
Sampling: Best-of-N selection with $N=32$ candidates from the flow-based prior ( $\pi_\beta$ with 10 coupling steps).
Optimization: Adam optimizer, learning rate $3 \times 10^{-4}$ , batch size $4096$ for stability, target network update $\lambda = 5 \times 10^{-3}$ , $\gamma=0.999$ .
Engineering practices: Large batches ensure expectile regression stability and mild upward bias in the critic; the action-flow prior regularizes the chunk sampling distribution.

6. Empirical Results and Comparative Analysis

DQC and the Distilled Partial Chunk Critic were evaluated on six OGBench long-horizon tasks: cube-triple, cube-quadruple, cube-octuple, humanoidmaze-giant, puzzle-4×5, and puzzle-4×6. Performance comparisons demonstrated that DQC outperformed:

Open-loop Q-chunking (QC) using full $k$ -step blocks.
Naïve partial chunking (QC–NS), which simply executes $l$ steps of the QC policy.
Standard $n$ -step return TD (NS) and 1-step TD (OS).
Prior state-of-art offline RL approaches including SHARSA, IQL, HIQL, FBC, and HFBC.

Ablation studies indicated that excluding the distilled critic (QC-NS variant) sharply reduced performance, particularly with $l>1$ . Varying the policy chunk length $l$ showed that $l=1$ or $l=5$ were both effective, but large $l$ recapitulated open-loop optimization hardness. The method displayed robustness to implicit loss type (expectile vs quantile) as long as mild optimism ( $\kappa > 0.5$ ) was maintained. Large batches ( $\geq 4096$ ) were essential for convergence and stability on challenging tasks, with larger best-of- $N$ values ( $N=32$ typical; increases to $128$ provided little further gain).

7. Significance and Practical Implications

The Distilled Partial Chunk Critic enables a principled decoupling between the benefits of long-horizon, multi-step TD backup (efficient credit assignment), and the closed-loop, reactive nature of deep RL policy execution. By learning a partial chunk critic through optimistic max regression from a $k$ -step chunked critic, and extracting policies via best-of-N sampling, DQC sidesteps the memorization and optimization bottlenecks inherent in open-loop chunk policies for long blocks. A plausible implication is improved scalability to more complex, sparse-reward, or high-horizon tasks without incurring prohibitive policy search or representational complexity (Li et al., 11 Dec 2025). This framework represents a significant development for offline RL, particularly in contexts where long-term planning is essential, yet policy flexibility and reactivity cannot be compromised.

Markdown Report Issue Upgrade to Chat

References (1)

Decoupled Q-Chunking (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distilled Partial Chunk Critic.