FAST: Frequency-Space Action Sequence Tokenization
- FAST is a compression-based action tokenization method that uses DCT, quantization, and BPE to transform continuous robotic trajectories into compact, invertible tokens.
- It employs a column-first flattening approach and quantile normalization to maintain spatio-temporal coherence and improve autoregressive performance in Transformer models.
- FAST⁺ generalizes across diverse robot morphologies, leading to efficient policy learning and reductions in sample complexity and compute in high-frequency, dexterous tasks.
Frequency-Space Action Sequence Tokenization (FAST) is a compression-based action discretization method designed for effective integration with Transformer-based vision-language action (VLA) models. FAST addresses fundamental limitations of conventional per-dimension, per-timestep binning, particularly in representing high-frequency and dexterous robot action trajectories. By leveraging the discrete cosine transform (DCT) and byte-pair encoding (BPE), followed by quantization and normalization, FAST produces compact, invertible, and tunable discrete token streams from continuous robot control signals, significantly reducing autoregressive horizon and sample complexity. The methodology facilitates efficient policy learning in vision-language-action models, supporting both generalist and specialized policies operating over diverse robotic morphologies and control rates (Pertsch et al., 16 Jan 2025).
1. Discrete Cosine Transform Formulation
FAST utilizes the type-II discrete cosine transform (DCT-II) for encoding each action channel into the frequency domain. For a length- action sequence , the transformation is given by:
Invertibility is maintained via the type-III DCT (DCT-III):
This procedure is applied independently to each of the action dimensions over temporal windows ("chunks") of steps, with typically corresponding to one second of control signal sampled at the robot's frequency. This frequency-space conversion is pivotal for reducing redundancy in highly smooth or periodic action sequences, especially at high sampling rates.
2. Quantization, Discretization, and Compression
After DCT transformation, FAST processes the resulting coefficient matrix via the following workflow:
- Quantile Normalization: Each action dimension's coefficients are scaled so the 1st and 99th percentiles map to , standardizing across diverse action scales.
- Scalar Quantization: All coefficients are multiplied by a scalar hyperparameter (default ) and rounded to integers:
governs the fidelity–compression trade-off: higher yields finer quantization and more tokens, lower induces coarser approximation and shorter sequences.
- Column-First Flattening: The quantized coefficient matrix is linearized such that all low-frequency coefficients across dimensions are sequenced before higher-frequency terms. This empirically improves Transformers’ autoregressive rollout stability.
- Byte-Pair Encoding: A BPE tokenizer (typical vocabulary size ) is fit to these integer sequences, merging repeated zeros and frequent patterns. BPE compression typically reduces sequence length by $5$– versus naive binning, yielding a final discrete token stream with .
The combination of DCT-domain sparsity, quantization control, and BPE achieves significant rate–distortion performance improvements without reliance on neural tokenizers. Empirically, and lead to sub-millimeter action RMSE across diverse tasks.
3. The FAST⁺ Universal Tokenizer
FAST⁺ extends FAST by providing a pretrained, architecture-agnostic universal action tokenizer:
- Training Data: 1,000,000 one-second action sequences from a mixture of single-arm, bi-manual, mobile, joint, and end-effector robots at rates from 5 to 50 Hz.
- Objective: BPE is trained on quantized DCT-flattened coefficients; no networks beyond the BPE merge table are learned.
- Generalization: FAST⁺ applies to novel robot morphologies and frequencies, achieving $2$– token count reduction over naive binning with no loss in reconstruction accuracy.
- Deployment: Exposed through HuggingFace AutoProcessor API, enabling black-box application in minimal code.
Without retraining, FAST⁺ compresses unseen action streams efficiently, supporting its designation as a universal robot action tokenizer.
4. Integration with Autoregressive Vision-Language-Action Models
FAST integrates seamlessly with Transformer-based VLAs—such as π₀, PaliGemma-3B, and Prismatic-7B—by substituting the least used tokens in the model’s vocabulary with FAST BPE tokens. The input sequence at training and inference comprises:
- [image tokens]
- [language instruction tokens]
- [proprioceptive tokens]
- [action tokens to be predicted (FAST tokens)]
Standard 1D positional encodings are used throughout the sequence, with FAST tokens occupying a contiguous tail region; no 2D encodings are added for time/frequency. The models employ standard next-token prediction with cross-entropy loss, and DCT inversion is performed offline after decoding. No auxiliary regression heads or additional objective terms are needed (Pertsch et al., 16 Jan 2025).
5. Empirical Performance and Policy Learning Outcomes
FAST achieves substantial token compression and improved policy learning efficiency, especially in high-frequency or dexterous manipulation contexts.
| Dataset | Action dims () | Control Hz | Naïve tokens | FAST tokens | Compression |
|---|---|---|---|---|---|
| BridgeV2 | 7 | 5 | 35 | 20 | |
| DROID | 7 | 15 | 105 | 29 | |
| Table Bussing | 7 | 20 | 140 | 28 | |
| T-Shirt Fold | 14 | 50 | 700 | 53 |
Key results include:
- High-frequency policy robustness: Naive binning fails on tasks >20 Hz; FAST maintains low reconstruction MSE up to 800 Hz.
- Sample and compute efficiency: On Table Bussing, π₀+FAST achieves task success in the updates required by diffusion π₀; in large-scale multitask learning (10k hours), π₀+FAST matches the final performance of diffusion models at less GPU compute.
- Zero-shot generalization: π₀+FAST attains average rubric score in first-ever zero-shot evaluation on unseen DROID environments, outperforming prior supervised baselines (Pertsch et al., 16 Jan 2025).
6. Design Considerations and Ablation Analyses
- Chunk Length: 1-second chunks offer a balance between compression and long-horizon consistency. Shorter chunks slightly reduce token count, but degrade temporal coherence.
- Flattening Order: "Column-first" (interleaving dimensions per frequency) yields superior rollout stability versus row-first ordering.
- BPE Ablation: Direct tokenization of each quantized coefficient without BPE still compresses but results in more tokens—primarily zeros—thus harming sample efficiency.
- Control Frequency: FAST consistently maintains low reconstruction MSE from 25 to 800 Hz; naive binning incurs rapidly deteriorating performance above 100 Hz.
- Inference Latency: Per-chunk autoregressive decoding requires 750 ms for 30–60 tokens, compared to 100 ms for diffusion π₀; further acceleration is possible via speculative decoding or quantized kernels.
7. Summary and Significance
FAST provides a DCT-quantization–BPE pipeline for transforming continuous robot trajectories into discrete, compact token sequences that are invertible and tunable. By alleviating the vanishing-information problem associated with naive discretizations and reducing the required autoregressive horizon, FAST enables efficient and scalable work with off-the-shelf vision-language Transformers in robotic control. The methodology supports broad applicability across robot morphologies and control frequencies, as demonstrated by the universal FAST⁺ tokenizer and its empirical performance on challenging dexterous and long-horizon tasks (Pertsch et al., 16 Jan 2025).