Distilled Partial Chunk Critic in Offline RL
- The Distilled Partial Chunk Critic is a novel value estimation technique that decouples long-horizon TD backup from reactive, closed-loop policy execution.
- It employs optimistic expectile regression to distill maximal value predictions from a k-step chunked critic for shorter action blocks, ensuring stable learning.
- Empirical results on long-horizon tasks demonstrate that the DQC framework outperforms traditional methods by balancing efficient credit assignment with flexible policy reactivity.
The Distilled Partial Chunk Critic is a value estimation approach in offline reinforcement learning that forms the core of the Decoupled Q-Chunking (DQC) framework. It enables multi-step value propagation by learning a critic over long open-loop action chunks, then distilling a partial-chunk critic that enables closed-loop, reactive policy extraction over much shorter action blocks. This design allows agents to benefit from efficient long-horizon temporal-difference (TD) backup without incurring the optimization and expressivity challenges associated with open-loop policies for long action sequences (Li et al., 11 Dec 2025).
1. Formal Definitions and Notation
The chunked critic framework defines two separate chunk lengths: the critic chunk length and the policy chunk length . An action chunk of length is written . The -step chunked critic, , estimates the expected discounted return from executing open-loop, then continuing optimally in future -chunks: The Distilled Partial Chunk Critic, denoted , approximates the maximal value achievable by optimally completing a given partial chunk: Only the first actions are produced and executed per decision step (closed-loop policy), and the process iterates.
2. Distilled Partial-Chunk Backup and Optimistic Regression
The key insight is to perform an "optimistic" backup: for any fixed partial action chunk , the value assigned is the maximal obtained by optimally completing the remainder of the chunk: Direct maximization is intractable; DQC adopts a practical approximation known as "max-plus distillation," implementing implicit-max regression via expectile loss with parameter to bias the estimator upwards toward the maximizing completion. This expectile regression operates as follows: where denotes the expectile squared-error loss, driving to match the maximal achievable value obtainable from .
3. Loss Functions, Optimization, and Policy Extraction
The DQC architecture involves three loss functions:
- Chunked Critic TD Loss: For , standard -step TD loss is applied:
- Partial Critic Distillation Loss: For , expectile regression towards the "teacher" on demonstration chunks:
- Value Head Loss: For , quantile regression (with high quantile) matches the maximal over candidate chunks:
The policy is not explicitly learned; instead, at test time, a set of action chunks are sampled from a flow-based behavior prior , and the chunk maximizing is selected:
4. Algorithmic Workflow and Pseudocode
The full DQC algorithm for offline batch RL operates as follows:
- For each gradient step:
- Sample trajectory-chunks from memory .
- Update parameters via -step TD loss.
- Update via expectile distillation from .
- Update via quantile regression towards .
- At test time, for each state :
- Draw candidate -step chunks from .
- Accept the chunk with maximal .
The decoupling of (backup length) and (policy horizon) permits the agent to plan long-term with but act flexibly at short horizons using .
5. Implementation Details and Architectural Choices
Key instantiations in DQC are as follows:
- Network architecture: 4-layer MLP, 1024 units/layer, with ReLU nonlinearities.
- Critic ensembles: Size ; minimum ensembling for cube tasks, mean aggregation for maze/puzzle settings.
- Chunk sizes: (critic), (policy, tuned per task).
- Loss and optimism: Expectile with for distillation, quantile with for value head.
- Sampling: Best-of-N selection with candidates from the flow-based prior ( with 10 coupling steps).
- Optimization: Adam optimizer, learning rate , batch size $4096$ for stability, target network update , .
- Engineering practices: Large batches ensure expectile regression stability and mild upward bias in the critic; the action-flow prior regularizes the chunk sampling distribution.
6. Empirical Results and Comparative Analysis
DQC and the Distilled Partial Chunk Critic were evaluated on six OGBench long-horizon tasks: cube-triple, cube-quadruple, cube-octuple, humanoidmaze-giant, puzzle-4×5, and puzzle-4×6. Performance comparisons demonstrated that DQC outperformed:
- Open-loop Q-chunking (QC) using full -step blocks.
- Naïve partial chunking (QC–NS), which simply executes steps of the QC policy.
- Standard -step return TD (NS) and 1-step TD (OS).
- Prior state-of-art offline RL approaches including SHARSA, IQL, HIQL, FBC, and HFBC.
Ablation studies indicated that excluding the distilled critic (QC-NS variant) sharply reduced performance, particularly with . Varying the policy chunk length showed that or were both effective, but large recapitulated open-loop optimization hardness. The method displayed robustness to implicit loss type (expectile vs quantile) as long as mild optimism () was maintained. Large batches () were essential for convergence and stability on challenging tasks, with larger best-of- values ( typical; increases to $128$ provided little further gain).
7. Significance and Practical Implications
The Distilled Partial Chunk Critic enables a principled decoupling between the benefits of long-horizon, multi-step TD backup (efficient credit assignment), and the closed-loop, reactive nature of deep RL policy execution. By learning a partial chunk critic through optimistic max regression from a -step chunked critic, and extracting policies via best-of-N sampling, DQC sidesteps the memorization and optimization bottlenecks inherent in open-loop chunk policies for long blocks. A plausible implication is improved scalability to more complex, sparse-reward, or high-horizon tasks without incurring prohibitive policy search or representational complexity (Li et al., 11 Dec 2025). This framework represents a significant development for offline RL, particularly in contexts where long-term planning is essential, yet policy flexibility and reactivity cannot be compromised.