Papers
Topics
Authors
Recent
2000 character limit reached

Macro Action Quantization (MAQ) in RL

Updated 26 November 2025
  • Macro Action Quantization (MAQ) is a reinforcement learning framework that constrains agent behavior by discretizing human demonstration trajectories into macro actions.
  • It employs a conditional VQVAE to encode and quantify human action segments, enabling efficient trajectory optimization over a discrete action space.
  • MAQ integrates with standard RL approaches using receding-horizon control, significantly enhancing human-likeness metrics such as DTW and Wasserstein distances.

Macro Action Quantization (MAQ) is a reinforcement learning (RL) framework designed to produce agents whose trajectories closely replicate those of human experts. Unlike standard RL approaches that typically optimize for reward alone and often yield reward-maximizing but unnatural behaviors, MAQ constrains agent behavior to human-like regions of trajectory space. This is achieved by quantizing segments of human demonstration trajectories into a discrete set of temporally extended “macro actions” via a conditional Vector-Quantized Variational Autoencoder (VQVAE). Agent policies are then defined over these macro action codes, rather than primitive actions, allowing for tractable, efficient trajectory optimization that simultaneously maximizes reward and human-likeness (Guo et al., 19 Nov 2025).

1. Trajectory Optimization for Human-Likeness

MAQ is formulated within a Markov Decision Process (MDP) M=(S,A,P,R,γ,T,p0)\mathcal{M} = (\mathcal{S},\mathcal{A},\mathcal{P},R,\gamma,T,p_0), where S\mathcal{S} denotes the state space, ARd\mathcal{A}\subset\mathbb{R}^d the continuous action space, and the remaining symbols identify the transition kernel, reward, discount, horizon, and initial state distribution, respectively. Traditional RL seeks the action sequence a0:T1a^*_{0:T-1} that maximizes the expected discounted return:

a0:T1=argmaxa0:T1ATE[t=0T1γtR(st,at)s0p0;a0:T1].a^*_{0:T-1} = \arg\max_{a_{0:T-1}\in\mathcal{A}^T} \mathbb{E}\biggl[\sum_{t=0}^{T-1}\gamma^t R(s_t,a_t)\Bigm|s_0\sim p_0;\,a_{0:T-1}\biggr].

To impose human-likeness, MAQ restricts the solution space to subsequences mtm_t drawn from a dataset D\mathcal{D} of human trajectory segments of fixed length HH, constructing a set H={mt(i)=(at(i),,at+H1(i))}\mathcal{H}=\{m_t^{(i)}=(a_t^{(i)},\dots,a_{t+H-1}^{(i)})\}. At each decision time tt, the problem is recast as:

mt=argmaxmtHE[i=0H1γiR(st+i,at+i)st;mt].m_t^* = \arg\max_{m_t\in\mathcal{H}} \mathbb{E}\biggl[\sum_{i=0}^{H-1}\gamma^i R(s_{t+i},a_{t+i})\Bigm|s_t;m_t\biggr].

This receding-horizon approach results in a closed-loop controller executing human-consistent macro actions at each planning interval.

2. Macro Action Learning via Conditional VQVAE

Macro actions are instantiated as discrete codebook entries e1,,eKRde_1,\dots,e_K\in\mathbb{R}^d, each representing a prototypical human macro action. MAQ trains a conditional VQVAE—with encoder fϕf_\phi and decoder gψg_\psi—on demonstration pairs (st,mt)(s_t, m_t), where mtm_t is a length-HH action segment conditioned on state sts_t. The encoder produces a latent ze=fϕ(st,mt)z_e = f_\phi(s_t, m_t) and assigns it to the nearest codebook entry eke_{k^*} with k=argminkzeek2k^* = \arg\min_k \|z_e - e_k\|_2, which is then decoded back to a sequence of actions:

m~t=gψ(st,ek).\widetilde m_t = g_\psi(s_t, e_{k^*}).

The VQVAE is trained via a composite loss:

L=mtm~t22+sg[ze]ek22+βzesg[ek]22,\mathcal{L} = \|m_t - \widetilde m_t\|_2^2 + \|\mathrm{sg}[z_e] - e_{k^*}\|_2^2 + \beta \|z_e - \mathrm{sg}[e_{k^*}]\|_2^2,

where sg[]\mathrm{sg}[\cdot] denotes the stop-gradient operator and β>0\beta>0 is a commitment weight. This objective compels the codebook to capture diverse, human-like action motifs and makes subsequent planning or policy learning tractable by discretization.

3. Receding-Horizon Macro Action Control

At execution, control is performed over codebook indices rather than primitive action vectors. For model-based settings, the planner evaluates each code k=1,,Kk=1,\dots,K by decoding its macro action for the current state and simulating its reward trajectory over the next HH steps. The code maximizing expected return is executed in the environment. In model-free variants, a learned policy πθ\pi_\theta directly outputs a categorical distribution over code indices given state sts_t, and the associated macro action is committed for HH steps. This structure reduces the complexity of planning in continuous action spaces, as the agent searches over a small discrete macro action set.

4. Integration with Standard RL Algorithms

MAQ converts the original continuous control MDP into a discrete Semi-Markov Decision Process (Semi-MDP), with the action set corresponding to macro action indices {1,,K}\{1,\dots,K\}. The agent’s policy πθ\pi_\theta is thus a categorical distribution over codes. Each RL step updates actor and critic (or value) functions based solely on (st,kt,Rt,st+H)(s_t, k_t, R_t, s_{t+H}), where RtR_t sums discounted rewards over the macro action’s duration. Primitive-action algorithms such as SAC are replaced by their discrete-action variants (e.g., DSAC). Replay buffers, loss functions, and value targets are left unchanged beyond this action-space transformation, supporting direct deployment atop common RL frameworks with minimal modification.

5. Quantitative Evaluation of Human-Likeness

MAQ employs two human-similarity metrics to assess the fidelity of agent trajectories:

  • Dynamic Time Warping (DTW): Measures the minimum-alignment cost between two state or action sequences, quantifying temporal similarity between trajectories,

DTW(τH,τA)=minW(i,j)WxiHxjA,\mathrm{DTW}(\tau^H,\tau^A) = \min_{\mathcal{W}} \sum_{(i,j)\in\mathcal{W}} \| x^H_i - x^A_j \|,

where xx may denote either states or actions. Average pairwise distance is reported across all human-agent pairs.

  • Wasserstein Distance (WD): Computes the Earth-Mover’s distance between empirical state (or action) distributions of humans and agents,

WD(ρhuman,ρagent)=minπΠ(ρhuman,ρagent)E(x,y)π[xy].\mathrm{WD}(\rho_{\rm human},\rho_{\rm agent}) = \min_{\pi\in\Pi(\rho_{\rm human},\rho_{\rm agent})} \mathbb{E}_{(x,y)\sim\pi}[\|x-y\|].

Lower DTW and WD indicate greater similarity, with normalization to [0,1][0,1] for comparability.

6. Experimental Results and Analysis

MAQ was evaluated on the D4RL Adroit benchmark for four domains (Door, Hammer, Pen, Relocate) and three RL backbones (IQL, SAC, RLPD), using codebook sizes K{8,16,32}K \in \{8,16,32\} and macro lengths H{1,...,9}H \in \{1, ..., 9\}. Across all settings, MAQ attained substantial reductions in human-agent DTW and WD relative to vanilla RL policies. For instance, on the Door task, MAQ+IQL improved the normalized state DTW from 0.43 (IQL) to 0.84, with analogous improvements for SAC (from 0.39 to 0.80) and RLPD (from 0.06 to 0.76). Task success rates remained essentially constant; e.g., MAQ+RLPD achieved a 93% success rate on Door (vs. 96% for RLPD), but with a fourfold increase in human-likeness. Ablation on HH revealed monotonic gains in both similarity and success, peaking at H=9H=9. Codebook size had little effect, with all tested values yielding comparable performance. In human Turing-Test (2AFC) evaluation, humans confused MAQ+RLPD trajectories with true human demonstrations 39% of the time (vs. 24% for vanilla RLPD), and MAQ+RLPD was judged more human-like than other agents in 71% of pairwise rankings, approaching the 74% rate for true humans (Guo et al., 19 Nov 2025).

7. Significance and Limitations

MAQ demonstrates that constraining RL agents to operate over quantized macro actions distilled from human experts can yield substantial improvements in human-likeness without sacrificing task performance. The framework’s minimal assumptions about the agent architecture and its compatibility with existing RL algorithms position it as a practical approach for trustworthy and interpretable agent design. A plausible implication is that this strategy could generalize to other domains requiring human-consistent behavior. Notably, sensitivity to macro-action length HH suggests that task-appropriate temporal abstraction is vital for balancing similarity and efficacy. The robustness of MAQ with respect to codebook size indicates stability with respect to this discretization parameter. Finally, the reliance on high-quality human demonstration data presents an intrinsic limitation, underscoring the need for diverse, representative datasets in applications demanding human-like RL agents (Guo et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Macro Action Quantization (MAQ).