Papers
Topics
Authors
Recent
Search
2000 character limit reached

Model-Based RL with Action Chunks (MAC)

Updated 4 July 2026
  • MAC is a model-based reinforcement learning method that elevates primitive actions to fixed-length chunks for long-horizon value expansion.
  • It employs chunked dynamics and reward models along with rejection sampling to mitigate compounding errors in offline settings.
  • The approach demonstrates strong performance on complex tasks while raising open challenges in adaptive chunk size and reactive control.

Searching arXiv for the MAC paper and closely related action-chunking references to ground the article. Using the arXiv search tool to verify the cited papers and surrounding literature. Model-Based RL with Action Chunks (MAC) denotes a form of offline model-based reinforcement learning in which the learned predictive interface is lifted from primitive actions to fixed-length sequences of actions. In the formulation introduced in "Scalable Offline Model-Based RL with Action Chunks" (Park et al., 8 Dec 2025), the dynamics model predicts a future state from a current state and an action chunk, model-based value expansion is performed over chunk-level imagined rollouts, and policy extraction is carried out by rejection sampling from an expressive behavioral action-chunk policy. The method is motivated by a specific long-horizon trade-off in offline model-based RL: larger value-expansion horizons reduce bootstrap bias, but one-step autoregressive world models accumulate model error over long rollouts (Park et al., 8 Dec 2025).

1. Conceptual definition and scope

In MAC, an action chunk is a fixed-length sequence of primitive actions, written as ai:j=(ai,ai+1,,aj)a_{i:j} = (a_i, a_{i+1}, \dots, a_j). The core chunked transition interface is

p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),

with a matching chunk policy

π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).

Relative to standard one-step model-based RL, the key change is that one model call now advances the imagined trajectory by nn environment steps rather than one (Park et al., 8 Dec 2025).

This gives MAC a specific position within the broader literature on action abstraction. It is directly about temporal abstraction, because the learned model consumes a sequence of primitive actions and predicts a state nn steps later. It is not merely an action reparameterization. This distinguishes it from "Predictable MDP Abstraction for Unsupervised Model-Based RL" (Park et al., 2023), which learns a latent action space for model-based control but explicitly frames its decoder as permitting predictable actions “without temporal abstraction” (Park et al., 2023). It is also distinct from model-free chunked RL methods such as "Reinforcement Learning with Action Chunking" (Li et al., 10 Jul 2025), which run TD learning directly in a chunked action space but do not learn a dynamics model (Li et al., 10 Jul 2025).

The broader MAC concept therefore refers to model-based RL methods that reason over temporally extended action sequences as the operative control unit. The named MAC recipe of (Park et al., 8 Dec 2025) is the most explicit instantiation in the provided literature: an offline model-based actor-critic with a chunked dynamics model, a chunked reward model, long-horizon chunk-based value expansion, and critic-guided rejection sampling from a behavioral chunk prior.

2. Core formulation and algorithmic recipe

The starting point is the offline RL setting in the MDP

M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),

with a fixed offline dataset D\mathcal{D} and no further environment interaction (Park et al., 8 Dec 2025). MAC learns two world-model components from chunked transitions (st,at:t+n1,rt,st+n)(s_t, a_{t:t+n-1}, r_t, s_{t+n}): a chunked dynamics model

pψ(st,at)st+n,p_\psi(s_t,a_t) \approx s_{t+n},

and a chunked reward model

rϕ(st,at)i=0n1γirt+i,r_\phi(s_t,a_t) \approx \sum_{i=0}^{n-1}\gamma^i r_{t+i},

where p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),0 is shorthand for the chunk p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),1 (Park et al., 8 Dec 2025). Their training losses are

p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),2

The policy side is not trained as a conventional reward-maximizing actor. Instead, MAC fits an expressive behavioral action-chunk policy by flow matching and then distills that ODE-based sampler into a one-step MLP sampler for efficiency (Park et al., 8 Dec 2025). At decision time, the policy is defined distributionally by rejection sampling: p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),3 so candidate chunks come from the behavioral model, while the critic chooses among them (Park et al., 8 Dec 2025). This is a central design choice: MAC controls out-of-distribution action selection by restricting the candidate set to BC-generated chunk proposals rather than relying primarily on explicit uncertainty penalties.

Value learning uses chunk-based model-based value expansion. Starting from real dataset states, MAC rolls out the rejection-sampling policy in the chunked world model for p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),4 chunk steps, producing imagined trajectories that span p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),5 primitive environment steps: p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),6 The state-value loss is then

p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),7

while the chunk critic is trained with

p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),8

The practical implication is that MAC performs value expansion over long primitive horizons while only recursively applying the dynamics model at chunk resolution (Park et al., 8 Dec 2025).

This reduction in recursive model depth is the main technical rationale for the method. A one-step model must be queried p(st+nst,at:t+n1),p(s_{t+n}\mid s_t, a_{t:t+n-1}),9 times to predict π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).0 primitive steps. MAC instead needs only π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).1 recursive model transitions, because each model call predicts the state after an entire chunk. The paper argues that this changes the usual model-based value-expansion trade-off by allowing long-horizon imagined returns without equally long autoregressive model chains (Park et al., 8 Dec 2025).

3. Relation to earlier and adjacent action abstractions

MAC sits at the intersection of two earlier lines of work: learned action abstractions for model-based control, and chunked action spaces for long-horizon RL. The most directly relevant model-free precursor is Q-chunking, introduced in "Reinforcement Learning with Action Chunking" (Li et al., 10 Jul 2025). Q-chunking treats a fixed-length open-loop action sequence as the effective RL action, trains a chunk-level critic π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).2, and argues that chunk-valued backups can provide unbiased multi-step TD targets because the critic conditions on the exact executed sequence (Li et al., 10 Jul 2025). MAC inherits the same basic insight—that long-horizon learning can be stabilized by elevating the control unit from a primitive action to a short open-loop sequence—but moves that idea into an explicitly model-based value-expansion setting (Park et al., 8 Dec 2025).

A different adjacent line appears in "Predictable MDP Abstraction for Unsupervised Model-Based RL" (Park et al., 2023). PMA is genuinely model-based and learns a latent action space used for downstream planning, but its latent actions are decoded into single primitive actions at the current step and the paper explicitly contrasts its goal with temporally extended skill learning, describing the method as action transformation “without temporal abstraction” (Park et al., 2023). PMA is therefore relevant to MAC as an example of learned abstract actions for model-based control, but not as an action-chunk method in the temporal sense.

Another neighboring direction is demonstration-derived chunk discovery or chunk regularization without a world model. "Learning Human-Like RL Agents Through Trajectory Optimization With Action Quantization" (Guo et al., 19 Nov 2025) distills human demonstrations into fixed-length macro actions via a conditional VQ-VAE and motivates the method with receding-horizon control, but the implementation does not learn a dynamics model and the online decision rule is best characterized as RL over a learned discrete macro-action space (Guo et al., 19 Nov 2025). Likewise, "Action abstractions for amortized sampling" (Boussif et al., 2024) mines frequent action subsequences from successful trajectories and inserts them into the action vocabulary, but it does so in model-free RL and GFlowNet settings rather than in model-based planning (Boussif et al., 2024).

These comparisons delimit MAC precisely. It is not simply any action abstraction for model-based control, and it is not simply any chunked RL method. Its distinctive identity comes from combining a learned chunk-transition model, chunk-level reward prediction, model-based value expansion, and behavioral rejection sampling in offline RL (Park et al., 8 Dec 2025).

4. Empirical profile of the MAC recipe

The empirical focus of MAC is long-horizon offline RL on large datasets. On OGBench-derived goal-conditioned tasks with datasets of up to 100M transitions, the paper reports that MAC achieves the best performance among offline model-based RL algorithms, especially on challenging long-horizon tasks (Park et al., 8 Dec 2025). Representative results include cube-double at π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).3, cube-octuple at π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).4, puzzle-3x3 at π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).5, and puzzle-4x5 at π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).6. In the same table, F-MPC reports π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).7 on cube-octuple and π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).8 on puzzle-4x5, illustrating the gap on the hardest manipulation domains (Park et al., 8 Dec 2025).

On standard reward-based manipulation benchmarks, MAC reports averages of π(at:t+n1st).\pi(a_{t:t+n-1}\mid s_t).9 on cube-single, nn0 on cube-double, nn1 on scene, nn2 on puzzle-3x3, and nn3 on puzzle-4x4, and the paper states that it achieves the best performance on 4/5 environments (Park et al., 8 Dec 2025). The stronger performance is concentrated in longer-horizon manipulation tasks, while humanoid locomotion-style environments remain difficult for all model-based methods considered.

The ablation results clarify what the chunk abstraction is buying. When chunk length is varied over nn4, larger chunk sizes substantially reduce rollout model error, and the one-step model diverges over long rollouts (Park et al., 8 Dec 2025). At the same time, the policy-performance ablation shows that chunking helps only up to a point: on cube-octuple, no chunking (nn5) cannot solve the task at all, but excessively large chunks (nn6) hurt because open-loop prediction and chunk action-value estimation both become harder (Park et al., 8 Dec 2025). The intended conclusion is not that “larger is always better,” but that chunking moves the bias/model-error trade-off into a more favorable regime when the chunk size is chosen appropriately.

The behavioral prior is equally central. Replacing the flow-based chunk policy with a Gaussian policy causes severe collapse: MAC(Gau) reports nn7 on cube-single, nn8 on cube-double, nn9 on scene, and nn0 on puzzle-4x4, versus nn1, nn2, nn3, and nn4 for the full MAC configuration in the same ablation table (Park et al., 8 Dec 2025). Distillation from the ODE flow sampler into the one-step sampler is also not optional: training the fast sampler directly with BC instead of distilling from the flow model yields major failure across the same tasks (Park et al., 8 Dec 2025). These results indicate that MAC depends not only on chunked dynamics, but also on an expressive, multi-modal chunk prior that keeps policy extraction on the support of the offline data.

Implementation details reinforce the intended scale. The default configuration uses four-layer MLPs with layer normalization and GELU, learning rate nn5, target update rate nn6, discount factor nn7, action chunk size nn8, and rollout length nn9, so the value target spans roughly 100 environment steps (Park et al., 8 Dec 2025). The same M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),0 is used across all tasks in the main setup.

5. Adaptive duration, execution, and reactive control around MAC

Subsequent adjacent work has concentrated less on building new chunked world models than on three unresolved issues that MAC leaves open: how long a chunk should be, how chunk execution should interact with feedback, and whether chunk reasoning must imply chunk commitment at test time.

One response is adaptive chunk-length selection. "Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning" (Gireesh et al., 7 May 2026) argues that fixed chunk size is structurally mismatched to robotics tasks and proposes selecting among several chunk lengths by comparing a per-horizon, discount-normalized advantage rather than raw M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),1-values (Gireesh et al., 7 May 2026). "ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network" (Chen et al., 10 May 2026) and "Adaptive Action Chunking via Multi-Chunk Q Value Estimation" (Shin et al., 11 May 2026) push this further with causal Transformer critics that score every prefix of a proposed chunk and choose the best execution length at each chunk boundary, again without a learned dynamics model (Chen et al., 10 May 2026, Shin et al., 11 May 2026). For MAC, these papers suggest that fixed M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),2 is likely a simplifying assumption rather than a fundamental requirement.

A second line concerns reactivity under chunked execution. "SEAR: Sample Efficient Action Chunking Reinforcement Learning" (Nagy et al., 2 Mar 2026) combines large chunk sizes with receding-horizon execution by training chunk policies of size M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),3 but collecting data with random replanning intervals M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),4, and reports that training with larger chunks and evaluating with shorter replanning intervals can outperform training directly at the shorter chunk size (Nagy et al., 2 Mar 2026). "Temporal Action Selection for Action Chunking" (Weng et al., 6 Nov 2025) addresses a related problem at inference time by selecting among overlapping chunk proposals generated at different timesteps, thereby trying to recover both reactivity and motion coherence without abandoning chunk structure (Weng et al., 6 Nov 2025). The common implication is that open-loop chunk commitment and chunk-based reasoning need not coincide.

A third line concerns training-time use of chunk abstractions without chunk-level execution at deployment. "Chunk-Guided Q-Learning" (Song et al., 14 Mar 2026) trains an auxiliary chunk critic with temporally extended backups, but regularizes a single-step critic toward that chunk critic and returns a single-step policy at test time (Song et al., 14 Mar 2026). This is conceptually significant for MAC because it shows a distinct design philosophy: plan or learn with chunks, act reactively.

Real-time deployment work sharpens the execution problem further. "Real-Time Robot Execution with Masked Action Chunking" (Wang et al., 27 Jan 2026) studies asynchronous inference for chunked policies and argues that failures arise not only from inter-chunk discontinuity but also from intra-chunk inconsistency, where the executed prefix is stale relative to current perception (Wang et al., 27 Jan 2026). Although REMAC is not model-based, its diagnosis is directly relevant to any MAC system intended for robotics deployment.

6. Limitations, controversies, and open questions

The limitations of MAC are stated plainly in its own experiments. The method remains weak on long-horizon humanoid locomotion tasks, and the paper explicitly identifies contact-rich, erratic dynamics as a remaining bottleneck for model-based methods (Park et al., 8 Dec 2025). The world model in the main recipe is a deterministic MLP, and the authors suggest that more expressive generative or latent dynamics models are a natural future direction (Park et al., 8 Dec 2025). A plausible implication is that chunking alone does not remove the need for stronger world models when dynamics are highly discontinuous or partially observed.

A second limitation is that the chunk size is fixed. MAC uses the same M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),5 across tasks, and its ablations show both that chunking is essential and that excessively large chunks become difficult to evaluate and execute (Park et al., 8 Dec 2025). Later adaptive-length methods make this limitation explicit by arguing that the optimal chunk size varies across both tasks and states (Gireesh et al., 7 May 2026, Chen et al., 10 May 2026, Shin et al., 11 May 2026). This suggests that fixed chunk duration is one of the main remaining simplifications in the original MAC formulation.

A third limitation concerns feedback. MAC’s value expansion and policy extraction are chunk-centric, but the paper does not develop a detailed receding-horizon execution scheme in the style of MPC. Adjacent work on chunked control repeatedly emphasizes that long chunks reduce responsiveness, that asynchronous or delayed execution introduces train-test mismatch, and that execution-time arbitration among chunk prefixes or overlapping chunk proposals can materially affect performance (Wang et al., 27 Jan 2026, Weng et al., 6 Nov 2025, Nagy et al., 2 Mar 2026). This suggests that the full MAC problem is not exhausted by learning M=(S,A,r,p,μ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, r, p, \mu),6; it also includes deciding how much of a planned chunk to commit before replanning.

Finally, there is a broader conceptual controversy over what chunking is for. In MAC, chunking is primarily a device for reducing compounding model error and supporting long-horizon value expansion (Park et al., 8 Dec 2025). In Q-chunking, it is also a device for unbiased multi-step TD backups and temporally coherent exploration (Li et al., 10 Jul 2025). In MAQ, it becomes a mechanism for human-like behavior regularization (Guo et al., 19 Nov 2025). In REMAC and TAS, the focus shifts to execution reliability and reactivity (Wang et al., 27 Jan 2026, Weng et al., 6 Nov 2025). The literature therefore does not support a single universal interpretation of action chunks. Instead, it shows that the same abstraction can serve different roles: model simplification, search-space restriction, behavior prior, temporal smoothing, or real-time systems support.

In that broader sense, MAC is best understood as the model-based member of a larger action-chunking family. Its specific contribution is to show that chunked world models and chunk-restricted policy extraction can make model-based value expansion scale to much longer offline horizons than one-step autoregressive alternatives (Park et al., 8 Dec 2025). Its unresolved questions—adaptive duration, reactive execution, stronger world models, and uncertainty-aware chunk planning—define much of the surrounding research agenda.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Model-Based RL with Action Chunks (MAC).