Papers
Topics
Authors
Recent
Search
2000 character limit reached

Distilled Partial Chunk Critic in Offline RL

Updated 28 February 2026
  • The Distilled Partial Chunk Critic is a novel value estimation technique that decouples long-horizon TD backup from reactive, closed-loop policy execution.
  • It employs optimistic expectile regression to distill maximal value predictions from a k-step chunked critic for shorter action blocks, ensuring stable learning.
  • Empirical results on long-horizon tasks demonstrate that the DQC framework outperforms traditional methods by balancing efficient credit assignment with flexible policy reactivity.

The Distilled Partial Chunk Critic is a value estimation approach in offline reinforcement learning that forms the core of the Decoupled Q-Chunking (DQC) framework. It enables multi-step value propagation by learning a critic over long open-loop action chunks, then distilling a partial-chunk critic that enables closed-loop, reactive policy extraction over much shorter action blocks. This design allows agents to benefit from efficient long-horizon temporal-difference (TD) backup without incurring the optimization and expressivity challenges associated with open-loop policies for long action sequences (Li et al., 11 Dec 2025).

1. Formal Definitions and Notation

The chunked critic framework defines two separate chunk lengths: the critic chunk length kk and the policy chunk length l<kl < k. An action chunk of length kk is written a1:k=(a1,,ak)a_{1:k} = (a_1,\dots,a_k). The kk-step chunked critic, QC(s,a1:k)Q^C(s,a_{1:k}), estimates the expected discounted return from executing a1:ka_{1:k} open-loop, then continuing optimally in future kk-chunks: QC(st,at:t+k1)E[i=0k1γir(st+i,at+i)+γkmaxat+k:t+2k1QC(st+k,at+k:t+2k1)].Q^{C}(s_{t},a_{t:t+k-1}) \approx \mathbb{E}\left[\sum_{i=0}^{k-1}\gamma^i r(s_{t+i},a_{t+i}) + \gamma^k \max_{a'_{t+k:t+2k-1}} Q^{C}(s_{t+k},a'_{t+k:t+2k-1})\right]. The Distilled Partial Chunk Critic, denoted QP(s,a1:l)Q^P(s,a_{1:l}), approximates the maximal value achievable by optimally completing a given partial chunk: QP(s,a1:l)maxal+1:kQC(s,[a1:l,al+1:k]).Q^P(s,a_{1:l}) \approx \max_{a_{l+1:k}} Q^C(s,[a_{1:l},a_{l+1:k}]). Only the first ll actions are produced and executed per decision step (closed-loop policy), and the process iterates.

2. Distilled Partial-Chunk Backup and Optimistic Regression

The key insight is to perform an "optimistic" backup: for any fixed partial action chunk a1:la_{1:l}, the value assigned is the maximal QCQ^C obtained by optimally completing the remainder of the chunk: QP(s,a1:l)maxal+1:kQC(s,[a1:l,al+1:k]).Q^P(s,a_{1:l}) \leftarrow \max_{a_{l+1:k}} Q^C(s,[a_{1:l},a_{l+1:k}]). Direct maximization is intractable; DQC adopts a practical approximation known as "max-plus distillation," implementing implicit-max regression via expectile loss with parameter κ>0.5\kappa > 0.5 to bias the estimator upwards toward the maximizing completion. This expectile regression operates as follows: Ldistill(ψ)=E(s,a1:k)D[fimpκ(QϕˉC(s,a1:k)QψP(s,a1:l))],L_{\text{distill}}(\psi) = \mathbb{E}_{(s,a_{1:k})\sim D}\left[f^{\kappa}_{\text{imp}}(Q^C_{\bar{\phi}}(s,a_{1:k}) - Q^P_{\psi}(s,a_{1:l}))\right], where fimpκf^{\kappa}_{\text{imp}} denotes the expectile squared-error loss, driving QPQ^P to match the maximal achievable value obtainable from QCQ^C.

3. Loss Functions, Optimization, and Policy Extraction

The DQC architecture involves three loss functions:

  • Chunked Critic TD Loss: For QCQ^C, standard kk-step TD loss is applied:

LQC(ϕ)=E[(Qϕ(s,a1:k)(R0:k1+γkV(sk)))2].L_{\rm QC}(\phi) = \mathbb{E}\left[\left(Q_{\phi}(s,a_{1:k}) - (R_{0:k-1}+\gamma^{k} V(s_k))\right)^2\right].

  • Partial Critic Distillation Loss: For QPQ^P, expectile regression towards the "teacher" QCQ^C on demonstration chunks:

Ldistill(ψ)=E[fexpectileκd(QϕC(s,a1:k)QψP(s,a1:l))].L_{\rm distill}(\psi) = \mathbb{E}\left[f^{\kappa_d}_{\text{expectile}}(Q^C_{\phi}(s,a_{1:k}) - Q^P_{\psi}(s,a_{1:l}))\right].

  • Value Head Loss: For VξV_\xi, quantile regression (with high quantile) matches the maximal QPQ^P over candidate chunks:

LV(ξ)=E[fquantileκb(QˉψP(s,a1:l)Vξ(s))].L_{V}(\xi) = \mathbb{E}\left[f^{\kappa_b}_{\text{quantile}}(\bar{Q}^P_{\psi}(s,a_{1:l}) - V_{\xi}(s))\right].

The policy is not explicitly learned; instead, at test time, a set of NN action chunks {a1:li}i=1N\{a_{1:l}^i\}_{i=1}^N are sampled from a flow-based behavior prior πβ(s)\pi_\beta(\cdot|s), and the chunk maximizing QPQ^P is selected: a=argmaxiQψP(s,a1:li).a^{*} = \arg\max_{i} Q^P_{\psi}(s,a_{1:l}^i).

4. Algorithmic Workflow and Pseudocode

The full DQC algorithm for offline batch RL operates as follows:

  1. For each gradient step:
    • Sample trajectory-chunks (st...st+k,at...at+k1,rt...rt+k1)(s_t...s_{t+k}, a_t...a_{t+k-1}, r_t...r_{t+k-1}) from memory DD.
    • Update QCQ^C parameters via kk-step TD loss.
    • Update QPQ^P via expectile distillation from QCQ^C.
    • Update VξV_\xi via quantile regression towards QPQ^P.
  2. At test time, for each state ss:
    • Draw NN candidate ll-step chunks from πβ\pi_\beta.
    • Accept the chunk with maximal QPQ^P.

The decoupling of kk (backup length) and ll (policy horizon) permits the agent to plan long-term with QCQ^C but act flexibly at short horizons using QPQ^P.

5. Implementation Details and Architectural Choices

Key instantiations in DQC are as follows:

  • Network architecture: 4-layer MLP, 1024 units/layer, with ReLU nonlinearities.
  • Critic ensembles: Size K=2K=2; minimum ensembling for cube tasks, mean aggregation for maze/puzzle settings.
  • Chunk sizes: k=25k=25 (critic), l{1,5}l \in \{1,5\} (policy, tuned per task).
  • Loss and optimism: Expectile with κd0.8\kappa_d \approx 0.8 for distillation, quantile with κb0.97\kappa_b \approx 0.97 for value head.
  • Sampling: Best-of-N selection with N=32N=32 candidates from the flow-based prior (πβ\pi_\beta with 10 coupling steps).
  • Optimization: Adam optimizer, learning rate 3×1043 \times 10^{-4}, batch size $4096$ for stability, target network update λ=5×103\lambda = 5 \times 10^{-3}, γ=0.999\gamma=0.999.
  • Engineering practices: Large batches ensure expectile regression stability and mild upward bias in the critic; the action-flow prior regularizes the chunk sampling distribution.

6. Empirical Results and Comparative Analysis

DQC and the Distilled Partial Chunk Critic were evaluated on six OGBench long-horizon tasks: cube-triple, cube-quadruple, cube-octuple, humanoidmaze-giant, puzzle-4×5, and puzzle-4×6. Performance comparisons demonstrated that DQC outperformed:

  • Open-loop Q-chunking (QC) using full kk-step blocks.
  • Naïve partial chunking (QC–NS), which simply executes ll steps of the QC policy.
  • Standard nn-step return TD (NS) and 1-step TD (OS).
  • Prior state-of-art offline RL approaches including SHARSA, IQL, HIQL, FBC, and HFBC.

Ablation studies indicated that excluding the distilled critic (QC-NS variant) sharply reduced performance, particularly with l>1l>1. Varying the policy chunk length ll showed that l=1l=1 or l=5l=5 were both effective, but large ll recapitulated open-loop optimization hardness. The method displayed robustness to implicit loss type (expectile vs quantile) as long as mild optimism (κ>0.5\kappa > 0.5) was maintained. Large batches (4096\geq 4096) were essential for convergence and stability on challenging tasks, with larger best-of-NN values (N=32N=32 typical; increases to $128$ provided little further gain).

7. Significance and Practical Implications

The Distilled Partial Chunk Critic enables a principled decoupling between the benefits of long-horizon, multi-step TD backup (efficient credit assignment), and the closed-loop, reactive nature of deep RL policy execution. By learning a partial chunk critic through optimistic max regression from a kk-step chunked critic, and extracting policies via best-of-N sampling, DQC sidesteps the memorization and optimization bottlenecks inherent in open-loop chunk policies for long blocks. A plausible implication is improved scalability to more complex, sparse-reward, or high-horizon tasks without incurring prohibitive policy search or representational complexity (Li et al., 11 Dec 2025). This framework represents a significant development for offline RL, particularly in contexts where long-term planning is essential, yet policy flexibility and reactivity cannot be compromised.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Distilled Partial Chunk Critic.