BEAST: B-Spline Action Sequence Tokenizer
- The paper introduces BEAST as a deterministic, parameter-free approach that leverages B-spline curves to encode both continuous and categorical action sequences.
- It offers dual tokenization schemes—continuous and discrete—that enable parallel decoding and seamless integration with models like Transformer and conditional VAEs.
- Empirical benchmarks show BEAST achieving superior imitation learning performance with higher success rates and reduced inference latency compared to traditional methods.
The B-Spline Encoded Action Sequence Tokenizer (BEAST) is a deterministic, parameter-free approach for representing high-frequency, continuous or categorical action sequences as compact, fixed-length token sequences. It leverages B-spline curves to encode action chunks, ensuring smoothness, continuity, and computational efficiency. BEAST eliminates the need for dedicated tokenizer network training, supports both discrete and continuous tokenization, and enables parallel decoding for high-throughput imitation learning and sequential modeling tasks (Zhou et al., 6 Jun 2025, Reinwardt et al., 2 Jan 2026).
1. Mathematical Foundations and Core Formulation
BEAST models an action trajectory , by fitting clamped B-splines of degree with basis functions, yielding control points per chunk. The vector-valued B-spline parameterization is
where are learned control point coefficients and are degree- basis functions, defined recursively using the Cox–de Boor formula. The knot vector is uniform and clamped, duplicating boundaries times to enforce and , thus guaranteeing endpoint interpolation and continuity within and between chunks.
For encoding, the regression problem for each DoF involves constructing the basis matrix () and solving
with closed-form or ridge-regularized solution when the problem is ill-conditioned.
2. Token Construction, Quantization, and Decoding
BEAST supports two tokenization schemes:
- Continuous tokens: The real-valued matrix of control points (flattened) is used directly as the action token sequence (e.g., for VAEs).
- Discrete tokens: Each entry of is quantized uniformly (e.g., 8-bit values in ), creating a fixed-length, discrete token sequence with a constant vocabulary size per position (Zhou et al., 6 Jun 2025).
Reconstruction proceeds by dequantizing discrete tokens (if used), then evaluating the spline at the desired time grid via
with the chunk's boundary tokens set to automatically enforce smooth, -continuous transitions across segments.
3. Model Integration and Parallel Decoding
BEAST is designed for seamless integration into major sequence modeling architectures:
- Conditional VAE: Observation is mapped to a latent, and the decoder predicts the sequence of continuous B-spline control vectors, minimizing a reconstruction loss on control points and a KL divergence term. The dimensionality reduction is substantial: e.g., mapping 100 raw actions down to 15 spline tokens with minimal fidelity loss.
- Decoder-only Transformer: Observations (vision, language, proprioception) are encoded; learned slots serve as action embeddings; all tokens are simultaneously predicted under a bidirectional attention mask, supporting parallel decoding.
- Encoder-decoder Vision-LLM (Florence-2/BEAST-F): Discrete token vocabulary extends least-used text tokens; action tokens are embedded via learned position, enabling parallel, bidirectional decoding (Zhou et al., 6 Jun 2025).
In all cases, parallel decoding of all tokens per chunk allows significant speedup in inference throughput and reduced latency compared to step-wise autoregressive models.
4. Empirical Performance and Benchmarking
BEAST's efficacy is demonstrated across established simulation and real-robot benchmarks (Zhou et al., 6 Jun 2025):
- Simulation (166 tasks):
- [CALVIN]: 99.8% first-task success (BEAST-F), outperforming prior approaches.
- [LIBERO]: BEAST-F achieves 92.5% average, with top results in long-horizon tasks.
- Real-world (8 tasks, 3 robots):
- On the Franka Challenge, BEAST-D scores 76.6% average success, exceeding competing baselines (FAST at 28.5%).
BEAST achieves large computational advantages:
- Inference throughput (RTX 4090): BEAST-F achieves 617 Hz at 0.019 s latency per action chunk, compared to π₀ 288 Hz (0.104 s), Diffusion Policy 130 Hz (0.341 s).
- Training efficiency: BEAST-F reaches 80% success within 20k steps on LIBERO-LONG, while FAST and π₀ lag behind.
Ablation shows optimal spline degree –$4$ and basis vectors for a 100 Hz control stream, balancing smoothness and token compactness. Discrete tokens outperform continuous by task success in BEAST-F.
5. Adaptive Segmentation and Generalization
Subsequent generalizations such as the B-Spline Adaptive Tokenizer (BSAT) (Reinwardt et al., 2 Jan 2026) demonstrate that BEAST can be extended to adaptively select knot placements based on per-chunk curvature, concentrating tokens in regions of high dynamics. The adaptive algorithm numerically estimates per-chunk curvature via second differences of the sequence, allocates knots by inverting a clipped, normalized CDF derived from these curvature features, and solves the standard (possibly ridge-regularized) spline least-squares fitting problem. This approach remains deterministic and parameter-free aside from the action embedding table for categorical variables.
For categorical actions, BEAST first embeds action indices into vectors before spline fitting. The pipeline—embedding, curvature computation, adaptive knot selection, spline fit—remains unchanged. Missing or irregularly sampled actions can be handled by masking or imputation during fit.
6. Positional Encoding and Attention
To enable layered, multi-resolution attention over non-uniform token sequences, BEAST (as extended within BSAT) utilizes a hybrid positional encoding: each token embedding is the result of a learned linear mapping from its coefficient and (normalized) center, added to a learned absolute position embedding, and augmented with Rotary Positional Encoding (RoPE) using a per-layer learnable frequency base. This ensures both explicit absolute ordering and relative biases matched to actual token centers, empowering Transformer layers to focus on distinct temporal scales (Reinwardt et al., 2 Jan 2026).
7. Practical Implementation Guidelines
For maximal yield, BEAST recommends:
- Spline degree –$3$ for 100 Hz data, with higher degrees for noisier trajectories.
- Selection of control points per chunk such that –20.
- Independent encoding per DoF for multi-DoF robots, with interleaving of flattened tokens.
- Discrete quantization (typically 8 bits) for best performance with Transformers; continuous tokens for VAE-style models.
- Deterministic or data-driven (curvature-based) knot placement, depending on application context.
BEAST is released as open-source and can function as a drop-in tokenizer for sequence modeling pipelines in imitation learning and long sequential action modeling (Zhou et al., 6 Jun 2025).
References:
- [BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning, (Zhou et al., 6 Jun 2025)]
- [BSAT: B-Spline Adaptive Tokenizer for Long-Term Time Series Forecasting, (Reinwardt et al., 2 Jan 2026)]