Ordered Action Tokenization (OAT)
- Ordered Action Tokenization (OAT) is a method that maps continuous robot action chunks into ordered discrete tokens with high compression, total decodability, and causal, left-to-right ordering.
- It leverages transformer architectures with learnable registers and finite scalar quantization, using nested dropout to induce a coarse-to-fine refinement during autoregressive inference.
- Empirical evaluations on simulated and real robotic tasks demonstrate that OAT improves task success rates and latency compared to traditional discretization approaches.
Ordered Action Tokenization (OAT) defines a learned method for mapping continuous, high-dimensional robot action chunks into ordered, discrete token sequences. OAT addresses fundamental limitations in prior approaches to action discretization for autoregressive (AR) sequence models in robotics, simultaneously achieving high compression, total decodability, and a left-to-right (causal) token ordering suited for next-token prediction. The methodology leverages transformer architectures with learnable registers, finite scalar quantization, and ordering-inducing training objectives to produce a token space that supports flexible, coarse-to-fine inference and efficient integration into AR policy pipelines (Liu et al., 4 Feb 2026).
1. Motivation and Prior Approaches
Autoregressive sequence models, particularly transformers, have become central in language, vision, and robotic control tasks due to their capacity for discrete abstraction and token-level reasoning. However, continuous action domains pose a significant challenge, necessitating a tokenization scheme for robot action chunks that is compatible with AR modeling.
Three canonical approaches illustrate key trade-offs:
- Per-dimension binning ("Bin"): Uniformly bins each scalar, producing extremely long token sequences (), offering decodability but poor efficiency and no causal token structure.
- Frequency-domain compression ("FAST"): Utilizes Discrete Cosine Transform (DCT) and Byte-Pair Encoding (BPE) for high compression and a form of coarse-to-fine structure, but with partial, non-total decodability leading to potential reconstruction errors at inference time.
- Learned quantized latents (e.g., VQ-VAEs): Achieve high compression and total decodability, yet lack any intrinsic token ordering, undermining AR suitability and prefix decoding.
All three prior paradigms are deficient in at least one of: efficiency (token sequence length), reliable decoding, or meaningful sequential refinement. OAT is designed to address these simultaneously (Liu et al., 4 Feb 2026).
2. Formal Criteria and Definition
The OAT mapping
and inverse
are constructed to meet three explicit desiderata:
- High Compression: The compression ratio
with and , permitting tractable long-horizon AR modeling.
- Total Decodability: must define a total function
excluding any invalid or non-reconstructible token strings.
- Left-to-Right (Causal) Ordering: For any prefix length ,
with
guaranteeing that each additional token refines the reconstruction, implementing a coarse-to-fine semantics.
3. OAT Architecture and Training
OAT employs a learned autoencoding framework with the following core components:
- Transformer Encoder with Registers: The input sequence concatenates the raw action chunk with learnable register tokens , processed by a causally-masked transformer . The output serves as the latent bottleneck for quantization.
- Finite Scalar Quantization (FSQ): Each is quantized along each dimension with levels. The resulting discrete token is formed from the product quantization codebook, .
- Ordering-Inducing Training:
- Nested Dropout: On each minibatch, only a random prefix of tokens is unmasked; the remainder are masked, forcing the model to reconstruct from varying prefix lengths and yielding an intrinsic ordering of token importance.
- Causal Register-to-Register Attention: Registers can attend freely to action tokens, but are causality-constrained among themselves so that register cannot access registers .
- Training Objective: Minimize expected squared reconstruction error between original and decoded actions, conditioned on quantized latent tokens and masks. No explicit ordering penalty is needed; monotonic refinement arises via nested dropout and causal masking.
- Inference Algorithm: At test time, the AR policy generates tokens (), completes the sequence with masks, and decodes the resulting to actions. This supports continuous early-exit inference, trading off fidelity against latency (Liu et al., 4 Feb 2026).
4. Policy Integration and Inference Semantics
OAT's token space enables seamless coupling with AR policy models: For any prefix , the detokenizer can reconstruct a valid action chunk , and the mean squared distortion is guaranteed to decrease monotonically with . This framework underpins flexible "anytime" inference behavior, supporting low-latency, coarse control as well as high-precision, fine-grained action generation, as determined by the chosen token prefix length.
5. Empirical Evaluation and Benchmarking
OAT was evaluated on over 20 tasks sampled from LIBERO, RoboMimic, MetaWorld, and RoboCasa simulation benchmarks, as well as two real-world tabletop tasks (pick-and-place ball, stack cups) on an ARX-5 arm. Comparative baselines include Bin, FAST, the QueST learned tokenizer, and the DP diffusion policy.
Key empirical findings:
- OAT8 achieved the highest success rates on all simulation benchmarks, e.g., 56.3% on LIBERO vs. 48.2% (QueST) and 36.6% (DP).
- Performance improved monotonically as increased, validating the coarse-to-fine, ordered refinement.
- Inference latency grew linearly in ; OAT[4] matched QueST’s latency with equal or superior task success.
- On real robots, OAT[8] outperformed all baselines (e.g., 16/20 vs. 14/20 for DP), with trajectories that visibly improved in smoothness and precision with additional tokens (Liu et al., 4 Feb 2026).
6. Ablations, Limitations, and Prospective Extensions
Ablation studies reveal:
- Disabling nested dropout (“OAT×”) eliminates token ordering and significantly degrades performance (e.g., 56.3% → 35.2% on LIBERO).
- Longer action horizons demand larger bottlenecks () for fidelity, though excessively large or small codebooks hurt performance.
- An intermediate codebook size (1,000 codes) optimizes the trade-off between expressivity and AR modelability.
Noted limitations:
- The AR depth is fixed at deployment, though adaptive selection based on action complexity may be preferable.
- OAT currently addresses unimodal action spaces; multimodal or hierarchical tokenization strategies remain unexplored.
Prospective extensions include:
- Adaptive stopping rules for online selection.
- Hybridization with continuous diffusion or flow-based refinements.
- Integration into vision-language-action pipelines as a planning abstraction.
- Application to real-time and multi-agent control domains with stringent latency and coordination requirements.
OAT establishes a principled, empirically effective bridge between continuous robot action domains and discrete AR policy architectures, delivering scalable, decodable, and causally structured token representations that unlock flexible and reliable action generation (Liu et al., 4 Feb 2026).