B-Spline Adaptive Tokenizer (BSAT)
- The paper introduces BSAT as a parameter-free method that uses adaptive B-spline tokenization to efficiently compress time series while preserving key features.
- BSAT tokenizes data by fitting B-spline curves to fixed look-back windows and dynamically allocating knots in regions of high curvature.
- Its hybrid positional encoding, combining learned absolute and layer-wise rotary embeddings, reduces computational cost and improves transformer-based forecasts.
The B-Spline Adaptive Tokenizer (BSAT) is a parameter-free method for segmenting time series data into variable-length tokens by fitting B-spline approximations to fixed look-back windows. BSAT was developed to address limitations of transformer-based long-term forecasting, specifically the quadratic complexity of self-attention and the misalignment between uniform patching and underlying signal structure. By concentrating token placement in regions of high curvature and encoding both coefficient and position per basis, BSAT constructs fixed-size, semantically meaningful tokens, thereby enabling efficient compression and accurate retention of salient sequence features. Its design incorporates a hybrid positional encoding scheme with both learned absolute and layer-wise rotary embeddings (L-RoPE), facilitating multi-resolution temporal attention in transformer architectures. Empirical results demonstrate competitive forecasting performance and substantial reductions in memory and runtime, particularly at extreme compression rates (Reinwardt et al., 2 Jan 2026).
1. B-Spline Tokenization Mechanism
BSAT decomposes a time series segment into a parametrized B-spline curve. Given a normalized parameter domain , a nondecreasing knot vector is constructed as
where is the number of basis functions and the spline degree (typically ). The Cox–de Boor recursion yields the B-spline basis by staged interpolation. A least-squares fit to observed sequence samples produces coefficients , potentially regularized by a ridge term if the system is ill-conditioned:
where and . This compact representation enables adaptive summarization of the time series.
2. Adaptive Knot Placement and Feature Preservation
BSAT allocates knots according to local signal curvature. The “mass” of curvature is computed on intervals via
subject to clipping to control over-concentration:
Knot positions are then selected via inversion of the cumulative distribution of at uniform quantiles, emphasizing high-curvature subsequences (e.g., peaks or rapid transitions). This methodology preserves critical features of the underlying time series that may be lost with uniform down-sampling.
3. Token Representation and Embedding Construction
Each B-spline basis function yields a token , where is its fitted coefficient and central position is computed as
transposed to the time index domain. The token is embedded into a -dimensional vector by applying separate linear transformations to and and concatenating their outputs:
This configuration enables variable-length input segments to be faithfully and efficiently mapped into fixed-dimensional transformer-compatible sequences.
4. Hybrid Positional Encoding: LPE and L-RoPE
BSAT utilizes a hybrid positional encoding scheme to inject temporal information into the token stream:
- Learned Absolute Positional Encoding (LPE): Each token receives a learned embedding based on its rank order ,
maintaining relative and absolute token order within the segment.
- Layer-wise Learnable Rotary Positional Embedding (L-RoPE): Extending traditional RoPE, BSAT introduces a per-layer learnable log base for the rotary frequencies, enabling attention heads to operate with layer-specific temporal resolutions:
with complex-plane rotational encodings applied before attention score computation. This approach equips each transformer layer to learn distinct temporal dependencies.
5. Computational Complexity and Compression Characteristics
Standard transformers scale quadratically with input length: for tokens. BSAT reduces this by compressing the input sequence to tokens, with attention cost . The parameter (number of basis and thereby tokens) is decoupled from and chosen according to the desired compression ratio, with no requirement for fixed uniform patching. Empirical protocols demonstrate that reducing token count from $180$ to $45$ yields an 8-fold GPU memory reduction (e.g., to ) and a $30$– decrease in epoch runtime, which is especially relevant in resource-constrained deployments.
6. Empirical Assessment and Comparative Performance
Evaluation was performed on benchmark datasets ETTh1 (hourly temperature), Alabama PV 2006 (solar output), and Electricity Load Diagrams 2011–2014 (total load), with chronological 60/20/20 train/validation/test splits. Competing baselines included Uniform Down-Sampled Transformer (UDS) and PatchTST. Hyperparameters were selected via Bayesian optimization across $200$ trials. For extreme compression ( tokens), BSAT with hybrid L-RoPE+LPE encoding achieved the lowest median and minimum RMSE on ETTh1 and Alabama PV benchmarks, outperforming uniform and patching baselines. Specifically, for ETTh1, BSAT L-RoPE LPE achieved RMSE, compared to UDS LPE () and PatchTST LPE (). Performance at higher compression ratios demonstrated BSAT’s robustness in preserving predictive accuracy with significantly reduced resource requirements.
7. Applications and Contextual Significance
A principal application of BSAT is in long-term time series forecasting where the dataset size or memory constraints render conventional transformer architectures infeasible. By aligning tokenization to semantic features and distributing computational effort adaptively across the signal, BSAT supports efficient modeling of temporally heterogeneous phenomena. The hybrid positional encoding scheme allows multi-layer, multi-resolution attention to temporal dependencies, bypassing the limitations of fixed stride patching and uniform down-sampling. Empirical results indicate that BSAT can match or exceed the accuracy of conventional baselines at a fraction of the computational cost; this suggests its relevance for deployment in real-time or resource-capped forecasting pipelines (Reinwardt et al., 2 Jan 2026).