Macro Action Quantization (MAQ) in RL
- Macro Action Quantization (MAQ) is a reinforcement learning framework that constrains agent behavior by discretizing human demonstration trajectories into macro actions.
- It employs a conditional VQVAE to encode and quantify human action segments, enabling efficient trajectory optimization over a discrete action space.
- MAQ integrates with standard RL approaches using receding-horizon control, significantly enhancing human-likeness metrics such as DTW and Wasserstein distances.
Macro Action Quantization (MAQ) is a reinforcement learning (RL) framework designed to produce agents whose trajectories closely replicate those of human experts. Unlike standard RL approaches that typically optimize for reward alone and often yield reward-maximizing but unnatural behaviors, MAQ constrains agent behavior to human-like regions of trajectory space. This is achieved by quantizing segments of human demonstration trajectories into a discrete set of temporally extended “macro actions” via a conditional Vector-Quantized Variational Autoencoder (VQVAE). Agent policies are then defined over these macro action codes, rather than primitive actions, allowing for tractable, efficient trajectory optimization that simultaneously maximizes reward and human-likeness (Guo et al., 19 Nov 2025).
1. Trajectory Optimization for Human-Likeness
MAQ is formulated within a Markov Decision Process (MDP) , where denotes the state space, the continuous action space, and the remaining symbols identify the transition kernel, reward, discount, horizon, and initial state distribution, respectively. Traditional RL seeks the action sequence that maximizes the expected discounted return:
To impose human-likeness, MAQ restricts the solution space to subsequences drawn from a dataset of human trajectory segments of fixed length , constructing a set . At each decision time , the problem is recast as:
This receding-horizon approach results in a closed-loop controller executing human-consistent macro actions at each planning interval.
2. Macro Action Learning via Conditional VQVAE
Macro actions are instantiated as discrete codebook entries , each representing a prototypical human macro action. MAQ trains a conditional VQVAE—with encoder and decoder —on demonstration pairs , where is a length- action segment conditioned on state . The encoder produces a latent and assigns it to the nearest codebook entry with , which is then decoded back to a sequence of actions:
The VQVAE is trained via a composite loss:
where denotes the stop-gradient operator and is a commitment weight. This objective compels the codebook to capture diverse, human-like action motifs and makes subsequent planning or policy learning tractable by discretization.
3. Receding-Horizon Macro Action Control
At execution, control is performed over codebook indices rather than primitive action vectors. For model-based settings, the planner evaluates each code by decoding its macro action for the current state and simulating its reward trajectory over the next steps. The code maximizing expected return is executed in the environment. In model-free variants, a learned policy directly outputs a categorical distribution over code indices given state , and the associated macro action is committed for steps. This structure reduces the complexity of planning in continuous action spaces, as the agent searches over a small discrete macro action set.
4. Integration with Standard RL Algorithms
MAQ converts the original continuous control MDP into a discrete Semi-Markov Decision Process (Semi-MDP), with the action set corresponding to macro action indices . The agent’s policy is thus a categorical distribution over codes. Each RL step updates actor and critic (or value) functions based solely on , where sums discounted rewards over the macro action’s duration. Primitive-action algorithms such as SAC are replaced by their discrete-action variants (e.g., DSAC). Replay buffers, loss functions, and value targets are left unchanged beyond this action-space transformation, supporting direct deployment atop common RL frameworks with minimal modification.
5. Quantitative Evaluation of Human-Likeness
MAQ employs two human-similarity metrics to assess the fidelity of agent trajectories:
- Dynamic Time Warping (DTW): Measures the minimum-alignment cost between two state or action sequences, quantifying temporal similarity between trajectories,
where may denote either states or actions. Average pairwise distance is reported across all human-agent pairs.
- Wasserstein Distance (WD): Computes the Earth-Mover’s distance between empirical state (or action) distributions of humans and agents,
Lower DTW and WD indicate greater similarity, with normalization to for comparability.
6. Experimental Results and Analysis
MAQ was evaluated on the D4RL Adroit benchmark for four domains (Door, Hammer, Pen, Relocate) and three RL backbones (IQL, SAC, RLPD), using codebook sizes and macro lengths . Across all settings, MAQ attained substantial reductions in human-agent DTW and WD relative to vanilla RL policies. For instance, on the Door task, MAQ+IQL improved the normalized state DTW from 0.43 (IQL) to 0.84, with analogous improvements for SAC (from 0.39 to 0.80) and RLPD (from 0.06 to 0.76). Task success rates remained essentially constant; e.g., MAQ+RLPD achieved a 93% success rate on Door (vs. 96% for RLPD), but with a fourfold increase in human-likeness. Ablation on revealed monotonic gains in both similarity and success, peaking at . Codebook size had little effect, with all tested values yielding comparable performance. In human Turing-Test (2AFC) evaluation, humans confused MAQ+RLPD trajectories with true human demonstrations 39% of the time (vs. 24% for vanilla RLPD), and MAQ+RLPD was judged more human-like than other agents in 71% of pairwise rankings, approaching the 74% rate for true humans (Guo et al., 19 Nov 2025).
7. Significance and Limitations
MAQ demonstrates that constraining RL agents to operate over quantized macro actions distilled from human experts can yield substantial improvements in human-likeness without sacrificing task performance. The framework’s minimal assumptions about the agent architecture and its compatibility with existing RL algorithms position it as a practical approach for trustworthy and interpretable agent design. A plausible implication is that this strategy could generalize to other domains requiring human-consistent behavior. Notably, sensitivity to macro-action length suggests that task-appropriate temporal abstraction is vital for balancing similarity and efficacy. The robustness of MAQ with respect to codebook size indicates stability with respect to this discretization parameter. Finally, the reliance on high-quality human demonstration data presents an intrinsic limitation, underscoring the need for diverse, representative datasets in applications demanding human-like RL agents (Guo et al., 19 Nov 2025).