Papers
Topics
Authors
Recent
Search
2000 character limit reached

BEAST: B-Spline Action Sequence Tokenizer

Updated 19 March 2026
  • The paper introduces BEAST as a deterministic, parameter-free approach that leverages B-spline curves to encode both continuous and categorical action sequences.
  • It offers dual tokenization schemes—continuous and discrete—that enable parallel decoding and seamless integration with models like Transformer and conditional VAEs.
  • Empirical benchmarks show BEAST achieving superior imitation learning performance with higher success rates and reduced inference latency compared to traditional methods.

The B-Spline Encoded Action Sequence Tokenizer (BEAST) is a deterministic, parameter-free approach for representing high-frequency, continuous or categorical action sequences as compact, fixed-length token sequences. It leverages B-spline curves to encode action chunks, ensuring smoothness, continuity, and computational efficiency. BEAST eliminates the need for dedicated tokenizer network training, supports both discrete and continuous tokenization, and enables parallel decoding for high-throughput imitation learning and sequential modeling tasks (Zhou et al., 6 Jun 2025, Reinwardt et al., 2 Jan 2026).

1. Mathematical Foundations and Core Formulation

BEAST models an action trajectory a1:T=[a1,...,aT]a_{1:T} = [a_1, ..., a_T], atRDa_t \in \mathbb{R}^D by fitting clamped B-splines of degree pp with NN basis functions, yielding NN control points per chunk. The vector-valued B-spline parameterization is

C(t)=i=0N1ciBi,p(t),t[0,1]C(t) = \sum_{i=0}^{N-1} c_i B_{i,p}(t), \quad t \in [0, 1]

where ciRDc_i \in \mathbb{R}^D are learned control point coefficients and Bi,p(t)B_{i,p}(t) are degree-pp basis functions, defined recursively using the Cox–de Boor formula. The knot vector τ=(τ0,...,τN+p)\tau = (\tau_0, ..., \tau_{N+p}) is uniform and clamped, duplicating boundaries p+1p+1 times to enforce C(0)=c0C(0)=c_0 and C(1)=cN1C(1)=c_{N-1}, thus guaranteeing endpoint interpolation and Cp1C^{p-1} continuity within and between chunks.

For encoding, the regression problem for each DoF involves constructing the basis matrix ΦRT×N\Phi \in \mathbb{R}^{T \times N} (Φt,i=Bi,p(t/T)\Phi_{t,i} = B_{i,p}(t/T)) and solving

c=argmincΦca1:T22c^* = \arg\min_c \|\Phi c - a_{1:T}\|_2^2

with closed-form or ridge-regularized solution when the problem is ill-conditioned.

2. Token Construction, Quantization, and Decoding

BEAST supports two tokenization schemes:

  • Continuous tokens: The real-valued matrix of control points PRD×NP \in \mathbb{R}^{D \times N} (flattened) is used directly as the action token sequence (e.g., for VAEs).
  • Discrete tokens: Each entry of PP is quantized uniformly (e.g., 8-bit values in [0,255][0,255]), creating a fixed-length, discrete token sequence with a constant vocabulary size per position (Zhou et al., 6 Jun 2025).

Reconstruction proceeds by dequantizing discrete tokens (if used), then evaluating the spline at the desired time grid via

a^t=i=0N1PiBi,p(t/T)\hat{a}_t = \sum_{i=0}^{N-1} P_i B_{i,p}(t/T)

with the chunk's boundary tokens set to automatically enforce smooth, Cp1C^{p-1}-continuous transitions across segments.

3. Model Integration and Parallel Decoding

BEAST is designed for seamless integration into major sequence modeling architectures:

  • Conditional VAE: Observation is mapped to a latent, and the decoder predicts the sequence of continuous B-spline control vectors, minimizing a reconstruction loss on control points and a KL divergence term. The dimensionality reduction is substantial: e.g., mapping 100 raw actions down to 15 spline tokens with minimal fidelity loss.
  • Decoder-only Transformer: Observations (vision, language, proprioception) are encoded; DND N learned slots serve as action embeddings; all tokens are simultaneously predicted under a bidirectional attention mask, supporting parallel decoding.
  • Encoder-decoder Vision-LLM (Florence-2/BEAST-F): Discrete token vocabulary extends least-used text tokens; action tokens are embedded via learned position, enabling parallel, bidirectional decoding (Zhou et al., 6 Jun 2025).

In all cases, parallel decoding of all DND N tokens per chunk allows significant speedup in inference throughput and reduced latency compared to step-wise autoregressive models.

4. Empirical Performance and Benchmarking

BEAST's efficacy is demonstrated across established simulation and real-robot benchmarks (Zhou et al., 6 Jun 2025):

  • Simulation (166 tasks):
    • [CALVIN]: 99.8% first-task success (BEAST-F), outperforming prior approaches.
    • [LIBERO]: BEAST-F achieves 92.5% average, with top results in long-horizon tasks.
  • Real-world (8 tasks, 3 robots):
    • On the Franka Challenge, BEAST-D scores 76.6% average success, exceeding competing baselines (FAST at 28.5%).

BEAST achieves large computational advantages:

  • Inference throughput (RTX 4090): BEAST-F achieves 617 Hz at 0.019 s latency per action chunk, compared to π₀ 288 Hz (0.104 s), Diffusion Policy 130 Hz (0.341 s).
  • Training efficiency: BEAST-F reaches 80% success within 20k steps on LIBERO-LONG, while FAST and π₀ lag behind.

Ablation shows optimal spline degree p3p\approx 3–$4$ and N10N\approx10 basis vectors for a 100 Hz control stream, balancing smoothness and token compactness. Discrete tokens outperform continuous by 12.7%\approx 12.7\% task success in BEAST-F.

5. Adaptive Segmentation and Generalization

Subsequent generalizations such as the B-Spline Adaptive Tokenizer (BSAT) (Reinwardt et al., 2 Jan 2026) demonstrate that BEAST can be extended to adaptively select knot placements based on per-chunk curvature, concentrating tokens in regions of high dynamics. The adaptive algorithm numerically estimates per-chunk curvature via second differences of the sequence, allocates knots by inverting a clipped, normalized CDF derived from these curvature features, and solves the standard (possibly ridge-regularized) spline least-squares fitting problem. This approach remains deterministic and parameter-free aside from the action embedding table for categorical variables.

For categorical actions, BEAST first embeds action indices into Rd\mathbb{R}^d vectors before spline fitting. The pipeline—embedding, curvature computation, adaptive knot selection, spline fit—remains unchanged. Missing or irregularly sampled actions can be handled by masking or imputation during fit.

6. Positional Encoding and Attention

To enable layered, multi-resolution attention over non-uniform token sequences, BEAST (as extended within BSAT) utilizes a hybrid positional encoding: each token embedding is the result of a learned linear mapping from its coefficient and (normalized) center, added to a learned absolute position embedding, and augmented with Rotary Positional Encoding (RoPE) using a per-layer learnable frequency base. This ensures both explicit absolute ordering and relative biases matched to actual token centers, empowering Transformer layers to focus on distinct temporal scales (Reinwardt et al., 2 Jan 2026).

7. Practical Implementation Guidelines

For maximal yield, BEAST recommends:

  • Spline degree p=2p=2–$3$ for 100 Hz data, with higher degrees for noisier trajectories.
  • Selection of NN control points per chunk such that T/N=8T/N=8–20.
  • Independent encoding per DoF for multi-DoF robots, with interleaving of flattened tokens.
  • Discrete quantization (typically 8 bits) for best performance with Transformers; continuous tokens for VAE-style models.
  • Deterministic or data-driven (curvature-based) knot placement, depending on application context.

BEAST is released as open-source and can function as a drop-in tokenizer for sequence modeling pipelines in imitation learning and long sequential action modeling (Zhou et al., 6 Jun 2025).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to B-Spline Encoded Action Sequence Tokenizer (BEAST).