Papers
Topics
Authors
Recent
2000 character limit reached

Distilled Critic for Partial Action Chunks

Updated 13 December 2025
  • The paper introduces a distilled critic that leverages long-horizon value information to improve multi-step backup efficiency and reactive policy optimization.
  • It decouples critic and policy horizons, enabling efficient value propagation while avoiding the pitfalls of open-loop action sequence commitment.
  • Experimental results on OGBench tasks show 10–30% performance improvements over traditional n-step methods and standard Q-chunking approaches.

A distilled critic for partial action chunks is a reinforcement learning construct that enables estimating the value of short action subsequences by leveraging and “distilling” value information from a higher-horizon, chunked critic. Originating from the Decoupled Q-Chunking (DQC) framework, this architecture addresses the challenge of efficient multi-step value backup without forcing the policy to commit to long open-loop action sequences, thus reconciling efficient long-horizon propagation with policy reactivity. The approach constructs a distilled partial-chunk critic via an optimistic backup operator, maximizing out over the possible completions of the current partial chunk, enabling a shorter policy chunk length while retaining the value-propagation benefits of long-horizon critics (Li et al., 11 Dec 2025).

1. Conceptual Motivation and Background

Standard temporal-difference (TD) methods bootstrap value estimates across single actions, lying at the heart of value-based RL algorithms. TD’s self-bootstrapping, however, introduces bootstrapping bias—the compounding of value estimation errors across steps. Recent approaches propose “chunked critics” that predict the expected return over sequences of actions (action “chunks”), not just single actions, which results in faster value backup and improved credit assignment for long-term dependencies.

Nevertheless, extracting a policy directly from such chunked critics is challenging. Policies trained to output entire chunks must act open-loop, unable to adjust to environment responses within a chunk, a significant limitation for tasks requiring reactivity or with extended chunk lengths. The distilled critic for partial action chunks overcomes this by decoupling the chunk horizon of the critic (h) from that of the policy (hₐ ≤ h), allowing the policy to operate over shorter, more tractable action subsequences.

2. Formal Construction

Let Qφ(s, a_{0:h}) estimate the return of an h-step action chunk (a₀,…,a_{h−1}), with R_{0:h}(s_{0:h},a_{0:h}) = ∑_{i=0}{h−1} γ{i} r(s_i, a_i). The standard chunked Bellman optimality equation involves backing up Q-values over complete h-step sequences:

Qϕ(s0,a0:h)Es1:h,ah:2h[R0:h+γhmaxah:2hQϕˉ(sh,ah:2h)]Q_\phi(s_0, a_{0:h}) \approx \mathbb{E}_{s_{1:h}, a_{h:2h}}[R_{0:h} + \gamma^h \max_{a_{h:2h}} Q_{\bar{\phi}}(s_h, a_{h:2h})]

where Q̄ is a slow-moving target network. The corresponding TD error loss is:

LQ(ϕ)=E(s0,a0:h,r0:h,sh)D[(Qϕ(s0,a0:h)R0:hγhVξ(sh))2]L_Q(\phi) = \mathbb{E}_{(s_0,a_{0:h},r_{0:h},s_h) \sim D}[(Q_\phi(s_0,a_{0:h}) - R_{0:h} - \gamma^h V_\xi(s_h))^2]

where VξV_\xi is computed by implicit regression (e.g., quantile/expectile) on QϕˉQ_{\bar{\phi}}.

The “distilled” partial-chunk Q-function QψP(s,a0:ha)Q^P_\psi(s, a_{0:h_a}) is defined for any ha<hh_a<h as:

QψP(s,a0:ha)maxaha:hQϕˉ(s,[a0:ha,aha:h])Q^P_\psi(s, a_{0:h_a}) \approx \max_{a_{h_a:h}} Q_{\bar{\phi}}(s, [a_{0:h_a}, a_{h_a:h}])

where [,][\cdot, \cdot] denotes concatenation. This is implemented via an implicit maximization loss (e.g., expectile or high-quantile regression):

Ldistill(ψ)=E(s,a0:h)D[fimpκd(Qϕˉ(s,a0:h)QψP(s,a0:ha))]L_{distill}(\psi) = \mathbb{E}_{(s,a_{0:h})\sim D}[f^{\kappa_d}_{imp}(Q_{\bar{\phi}}(s,a_{0:h}) - Q^P_\psi(s,a_{0:h_a}))]

where κd[0.5,1)\kappa_d \in [0.5,1) controls the degree of optimism.

3. Policy Optimization Using the Distilled Critic

Once the partial-chunk distilled critic QψPQ^P_\psi is learned, policy extraction becomes tractable for short action chunks:

  • The policy π(a0:has)\pi(a_{0:h_a} | s) is optimized with respect to QψPQ^P_\psi, using either
    • best-of-N sampling from an expressive behavior prior πβ\pi_\beta, or
    • fitting a parameterized policy πθ\pi_\theta via the objective Lπ(θ)=EsD,aπθ[QψP(s,a0:ha)]L_\pi(\theta) = -\mathbb{E}_{s\sim D, a\sim\pi_\theta}[Q^P_\psi(s,a_{0:h_a})].
  • Best-of-N extraction: draw a0:haiπβ(s)a^i_{0:h_a}\sim \pi_\beta(s) for i=1Ni=1…N, and select a0:ha=argmaxiQψP(s,a0:hai)a^*_{0:h_a} = \arg\max_i Q^P_\psi(s, a^i_{0:h_a}).

This training permits the policy to remain reactive and avoid committing open-loop to long sequences.

4. Practical Algorithm and Training Procedure

The following steps organize DQC with a distilled partial-chunk critic:

Step Description Objective / Loss
1 Sample h-step segments from D Data batch (s₀,a_{0:h},r_{0:h},s_h)
2 Update Qφ (chunked critic) LQL_Q
3 Update Q (distilled partial-chunk critic) LdistillL_{distill}
4 Update Vξ (implicit “value” function) LVL_V using fimpf_{imp}
5 Update target networks via Polyak averaging --
6 Policy extraction (test-time) Best-of-N sampling on Q

Parameters include critic-chunk length hh (e.g., 25), policy-chunk length hah_a (often 1), batch size (e.g., 4096), expectile/quantile for implicit regression (κb,κd\kappa_b,\kappa_d), and N for best-of-N.

5. Advantages and Theoretical Properties

Decoupling critic and policy chunk lengths yields several advantages:

  • Efficient value propagation: Using a long-horizon critic (h1h \gg 1) enables multi-step reward accumulation and mitigates bootstrapping bias (Li et al., 11 Dec 2025).
  • Policy tractability and reactivity: Shorter policy chunk (hahh_a \ll h) bypasses the need for the policy to output long open-loop sequences, which are challenging to learn and sub-optimal for reactive tasks.
  • Optimistic distillation: The distilled partial-chunk critic QψPQ^P_\psi approximates the maximal value extension, ensuring policies remain forward-looking despite executing short chunks.
  • Bounded sub-optimality: Theoretical results (Theorems 4.1–4.4 in (Li et al., 11 Dec 2025)) establish that open-loop consistency and optimistic distillation together provide guarantees even with off-policy data.

A plausible implication is that models using distilled partial-chunk critics could see improved stability and scalability on long-horizon offline tasks, where full open-loop policies are otherwise impractical.

6. Experimental Evaluation and Implementation Details

Evaluation of DQC with distilled critics was conducted on the OGBench offline goal-conditioned RL benchmark, including tasks such as cube-triple/quadruple/octuple, humanoidmaze-giant, and puzzle-4×5/4×6. Architectures used 4 layers of 1024 units (ReLU), twin-Q ensembles (K=2), a flow-based behavior prior πβ\pi_\beta with 10 flow steps, and best-of-N sampling (N=32N=32). Critical parameters were h{5,25}h\in\{5,25\}, ha{1,5,25}h_a\in\{1,5,25\}, γ=0.999\gamma=0.999, implicit regression quantile κb[0.5,0.99]\kappa_b\in[0.5,0.99], and distillation expectile κd{0.5,0.8}\kappa_d\in\{0.5,0.8\}.

DQC (e.g., h=25,ha=1h=25, h_a=1) outperformed Q-chunking (h=hah=h_a), n-step baselines, and prior approaches (IQL, HIQL, SHARSA) by 10–30 percentage points on the hardest OGBench tasks (Li et al., 11 Dec 2025). Optimization employed 10610^6 gradient steps, batch size 4096, Adam optimizer (lr 3×1043\times10^{-4}), and Polyak averaging (5e–3).

For reproducibility or extension, it is advised to choose hh based on bias-vs-data consistency, set hah_a for policy tractability (typically 1), tune hyperparameters (κb,κd,N\kappa_b, \kappa_d, N), and maintain large batch sizes and dataset coverage, supporting strong open-loop consistency.

7. Context, Adoption, and Extensions

The distilled critic for partial action chunks presents a systematic solution to the limitations encountered in multi-step value backup, especially in offline RL and long-horizon, goal-conditioned benchmarks. By separating value estimation and policy abstraction scales, it supports both more robust value propagation and practical, reactive policy learning. Adoption of such techniques is facilitated by open-source codebases (github.com/ColinQiyangLi/dqc).

Ongoing and future work may further investigate chunk horizon selection, extensions to non-stationary environments, and scaling behavior with even larger or sparser datasets. This paradigm is likely to influence algorithmic design in domains where policy reactivity and long-term credit assignment both play a crucial role (Li et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Distilled Critic for Partial Action Chunks.