Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ordered Action Tokenization (OAT)

Updated 8 February 2026
  • Ordered Action Tokenization (OAT) is a method that maps continuous robot action chunks into ordered discrete tokens with high compression, total decodability, and causal, left-to-right ordering.
  • It leverages transformer architectures with learnable registers and finite scalar quantization, using nested dropout to induce a coarse-to-fine refinement during autoregressive inference.
  • Empirical evaluations on simulated and real robotic tasks demonstrate that OAT improves task success rates and latency compared to traditional discretization approaches.

Ordered Action Tokenization (OAT) defines a learned method for mapping continuous, high-dimensional robot action chunks into ordered, discrete token sequences. OAT addresses fundamental limitations in prior approaches to action discretization for autoregressive (AR) sequence models in robotics, simultaneously achieving high compression, total decodability, and a left-to-right (causal) token ordering suited for next-token prediction. The methodology leverages transformer architectures with learnable registers, finite scalar quantization, and ordering-inducing training objectives to produce a token space that supports flexible, coarse-to-fine inference and efficient integration into AR policy pipelines (Liu et al., 4 Feb 2026).

1. Motivation and Prior Approaches

Autoregressive sequence models, particularly transformers, have become central in language, vision, and robotic control tasks due to their capacity for discrete abstraction and token-level reasoning. However, continuous action domains pose a significant challenge, necessitating a tokenization scheme for robot action chunks a1:HaRHa×Daa_{1:H_a}\in\mathbb{R}^{H_a\times D_a} that is compatible with AR modeling.

Three canonical approaches illustrate key trade-offs:

  • Per-dimension binning ("Bin"): Uniformly bins each scalar, producing extremely long token sequences (Hl=HaDaH_l=H_a\cdot D_a), offering decodability but poor efficiency and no causal token structure.
  • Frequency-domain compression ("FAST"): Utilizes Discrete Cosine Transform (DCT) and Byte-Pair Encoding (BPE) for high compression and a form of coarse-to-fine structure, but with partial, non-total decodability leading to potential reconstruction errors at inference time.
  • Learned quantized latents (e.g., VQ-VAEs): Achieve high compression and total decodability, yet lack any intrinsic token ordering, undermining AR suitability and prefix decoding.

All three prior paradigms are deficient in at least one of: efficiency (token sequence length), reliable decoding, or meaningful sequential refinement. OAT is designed to address these simultaneously (Liu et al., 4 Feb 2026).

2. Formal Criteria and Definition

The OAT mapping

T:  a1:Ha    T1:Hl,TiV\mathcal{T}:\;a_{1:H_a}\;\mapsto\;T_{1:H_l},\quad T_i\in\mathcal V

and inverse

T1:  T1:Hl    a^1:Ha\mathcal{T}^{-1}:\;T_{1:H_l}\;\mapsto\;\hat a_{1:H_a}

are constructed to meet three explicit desiderata:

  • High Compression: The compression ratio

R=Hllog2VHaDabR=\frac{H_l\cdot\log_2|\mathcal V|}{H_a\cdot D_a\cdot b}

with HlHaDaH_l\ll H_a D_a and R1R\ll 1, permitting tractable long-horizon AR modeling.

  • Total Decodability: T1\mathcal{T}^{-1} must define a total function

T1:HlVHl,T1(T1:Hl) is well-defined,\forall\,T_{1:H_l}\in\mathcal V^{H_l},\quad \mathcal{T}^{-1}(T_{1:H_l})\text{ is well-defined,}

excluding any invalid or non-reconstructible token strings.

  • Left-to-Right (Causal) Ordering: For any prefix length kk,

D(1)D(2)D(Hl)whereD(k)=E[a1:Haa^1:Ha(k)2]D(1)\ge D(2)\ge\cdots\ge D(H_l)\quad\text{where}\quad D(k)=\mathbb E[\|a_{1:H_a}-\hat a^{(k)}_{1:H_a}\|^2]

with

a^1:Ha(k)=T1(T1:k,  MASKk+1:Hl),\hat a^{(k)}_{1:H_a} =\mathcal{T}^{-1}\bigl(T_{1:k},\;\mathtt{MASK}_{k+1:H_l}\bigr),

guaranteeing that each additional token refines the reconstruction, implementing a coarse-to-fine semantics.

3. OAT Architecture and Training

OAT employs a learned autoencoding framework with the following core components:

  • Transformer Encoder with Registers: The input sequence concatenates the raw action chunk with HlH_l learnable register tokens r1:Hlr_{1:H_l}, processed by a causally-masked transformer EϕE_\phi. The output z1:Hlz_{1:H_l} serves as the latent bottleneck for quantization.
  • Finite Scalar Quantization (FSQ): Each ziRDlz_i\in\mathbb{R}^{D_l} is quantized along each dimension with LdL_d levels. The resulting discrete token TiT_i is formed from the product quantization codebook, V=d=1DlLd|\mathcal V|=\prod_{d=1}^{D_l}L_d.
  • Ordering-Inducing Training:
    • Nested Dropout: On each minibatch, only a random prefix KK of tokens is unmasked; the remainder are masked, forcing the model to reconstruct from varying prefix lengths and yielding an intrinsic ordering of token importance.
    • Causal Register-to-Register Attention: Registers can attend freely to action tokens, but are causality-constrained among themselves so that register ii cannot access registers >i>i.
  • Training Objective: Minimize expected squared reconstruction error between original and decoded actions, conditioned on quantized latent tokens and masks. No explicit ordering penalty is needed; monotonic refinement arises via nested dropout and causal masking.
  • Inference Algorithm: At test time, the AR policy generates KK tokens (T1,,TKT_1,\ldots,T_K), completes the sequence with masks, and decodes the resulting T1:HlT_{1:H_l} to actions. This supports continuous early-exit inference, trading off fidelity against latency (Liu et al., 4 Feb 2026).

4. Policy Integration and Inference Semantics

OAT's token space enables seamless coupling with AR policy models: p(T1:Hlo1:Ho)=i=1Hlp(TiT<i,o1:Ho)p(T_{1:H_l}\mid o_{1:H_o}) =\prod_{i=1}^{H_l}p(T_i\mid T_{<i},\,o_{1:H_o}) For any prefix T1:KT_{1:K}, the detokenizer can reconstruct a valid action chunk a^(K)\hat a^{(K)}, and the mean squared distortion D(K)D(K) is guaranteed to decrease monotonically with KK. This framework underpins flexible "anytime" inference behavior, supporting low-latency, coarse control as well as high-precision, fine-grained action generation, as determined by the chosen token prefix length.

5. Empirical Evaluation and Benchmarking

OAT was evaluated on over 20 tasks sampled from LIBERO, RoboMimic, MetaWorld, and RoboCasa simulation benchmarks, as well as two real-world tabletop tasks (pick-and-place ball, stack cups) on an ARX-5 arm. Comparative baselines include Bin, FAST, the QueST learned tokenizer, and the DP diffusion policy.

Key empirical findings:

  • OAT8 achieved the highest success rates on all simulation benchmarks, e.g., 56.3% on LIBERO vs. 48.2% (QueST) and 36.6% (DP).
  • Performance improved monotonically as KK increased, validating the coarse-to-fine, ordered refinement.
  • Inference latency grew linearly in KK; OAT[4] matched QueST’s latency with equal or superior task success.
  • On real robots, OAT[8] outperformed all baselines (e.g., 16/20 vs. 14/20 for DP), with trajectories that visibly improved in smoothness and precision with additional tokens (Liu et al., 4 Feb 2026).

6. Ablations, Limitations, and Prospective Extensions

Ablation studies reveal:

  • Disabling nested dropout (“OAT×”) eliminates token ordering and significantly degrades performance (e.g., 56.3% → 35.2% on LIBERO).
  • Longer action horizons demand larger bottlenecks (HlH_l) for fidelity, though excessively large or small codebooks hurt performance.
  • An intermediate codebook size (\sim1,000 codes) optimizes the trade-off between expressivity and AR modelability.

Noted limitations:

  • The AR depth KK is fixed at deployment, though adaptive selection based on action complexity may be preferable.
  • OAT currently addresses unimodal action spaces; multimodal or hierarchical tokenization strategies remain unexplored.

Prospective extensions include:

  • Adaptive stopping rules for online KK selection.
  • Hybridization with continuous diffusion or flow-based refinements.
  • Integration into vision-language-action pipelines as a planning abstraction.
  • Application to real-time and multi-agent control domains with stringent latency and coordination requirements.

OAT establishes a principled, empirically effective bridge between continuous robot action domains and discrete AR policy architectures, delivering scalable, decodable, and causally structured token representations that unlock flexible and reliable action generation (Liu et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Ordered Action Tokenization (OAT).