Reinforcement Learning with Action Chunking

Published 10 Jul 2025 in cs.LG, cs.AI, cs.RO, and stat.ML | (2507.07969v1)

Abstract: We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces Q-chunking, enabling RL to predict action sequences for enhanced exploration in long-horizon, sparse-reward settings.
It develops QC and QC-FQL algorithms that apply behavior constraints and unbiased n-step TD backups to improve learning stability.
Experimental results demonstrate that Q-chunking outperforms previous methods on complex manipulation tasks with superior temporal coherence.

Offline-to-Online RL via Action Chunking

The paper introduces Q-chunking, an approach designed to improve RL algorithms for long-horizon, sparse-reward tasks, particularly in the offline-to-online setting. This setting aims to maximize online sample efficiency by leveraging an offline prior dataset. Effective exploration and sample-efficient learning are central challenges, as the optimal use of offline data for acquiring a good exploratory policy remains unclear.

The key insight of the paper is the application of action chunking to TD-based RL methods. Action chunking, where policies predict sequences of future actions instead of single actions at each timestep, has been effective in imitation learning. Q-chunking runs RL in a 'chunked' action space, allowing the agent to leverage temporally consistent behaviors from offline data for better online exploration and to use unbiased $n$ -step backups for more stable and efficient TD learning.

The core contributions are:

Q-learning on Temporally Extended Action Space: The policy predicts a sequence of actions for the next h steps, executed open-loop. The critic evaluates the value of the entire sequence rather than a single action.
Behavior Constraints for Coherent Exploration: A behavior constraint is imposed on the policy, regularizing it towards prior behavior data to generate temporally coherent actions. This leverages non-Markovian structure in the offline data.
QC and QC-FQL Algorithms: Two practical offline-to-online RL algorithms are instantiated from the Q-chunking recipe. QC uses an implicit KL divergence constraint implemented via best-of-N sampling. QC-FQL uses a 2-Wasserstein distance constraint, leveraging the optimal transport framework.

The paper emphasizes that Q-chunking value backup is similar to $n$ -step returns but avoids bias because the $Q$ -function takes the entire action sequence into account. This is shown mathematically by comparing the TD backup equations for standard 1-step TD, n-step return, and Q-chunking:

$Q(s_t, a_t) \leftarrow r_t + \gamma Q(s_{t+1}, a_{t+1}) \quad \text{(standard 1-step TD)}$

$Q(s_t, a_t) \leftarrow \underbrace{\sum_{t'=t}^{t+h-1} \left[\gamma^{t'-t}r_{t'}\right]}_\text{biased} + \gamma^hQ(s_{t+h}, a_{t+h}) \quad \text{($n $-step return,$ n=h$)}$

$Q(s_t, a{t}{t+h}) \leftarrow \underbrace{\sum_{t'=t}^{t+h-1} \left[\gamma^{t'-t}r_{t'}\right]}_\text{unbiased} + \gamma^h Q(s_{t+h}, a{t+h}{t+2h}) \quad \text{(Q-chunking)}$

Experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior offline-to-online methods on long-horizon, sparse-reward manipulation tasks. Specifically, QC achieves state-of-the-art performance on the two hardest OGBench domains, cube-triple and cube-quadruple. The paper provides quantitative measures of temporal coherency in actions. {QC} exhibits a higher action temporal coherency compared to {BFN}, measured by the average $L_2$ norm of the difference vector of two adjacent end-effector positions.

The paper addresses the question of how action chunk length affects performance. Experiments show that a higher action chunk length generally helps, but not significantly.

Implications of the research include a practical and easily implementable method for improving sample efficiency in offline-to-online RL. This approach has the potential to accelerate the development of RL agents for complex robotic tasks.

Future research directions include developing mechanisms for automatically determining chunk boundaries and exploring techniques for training more general non-Markovian policies for online exploration.

Markdown Report Issue