B-Spline Adaptive Tokenizer (BSAT)

Updated 5 January 2026

The paper introduces BSAT as a parameter-free method that uses adaptive B-spline tokenization to efficiently compress time series while preserving key features.
BSAT tokenizes data by fitting B-spline curves to fixed look-back windows and dynamically allocating knots in regions of high curvature.
Its hybrid positional encoding, combining learned absolute and layer-wise rotary embeddings, reduces computational cost and improves transformer-based forecasts.

The B-Spline Adaptive Tokenizer (BSAT) is a parameter-free method for segmenting time series data into variable-length tokens by fitting B-spline approximations to fixed look-back windows. BSAT was developed to address limitations of transformer-based long-term forecasting, specifically the quadratic complexity of self-attention and the misalignment between uniform patching and underlying signal structure. By concentrating token placement in regions of high curvature and encoding both coefficient and position per basis, BSAT constructs fixed-size, semantically meaningful tokens, thereby enabling efficient compression and accurate retention of salient sequence features. Its design incorporates a hybrid positional encoding scheme with both learned absolute and layer-wise rotary embeddings (L-RoPE), facilitating multi-resolution temporal attention in transformer architectures. Empirical results demonstrate competitive forecasting performance and substantial reductions in memory and runtime, particularly at extreme compression rates (Reinwardt et al., 2 Jan 2026).

1. B-Spline Tokenization Mechanism

BSAT decomposes a time series segment into a parametrized B-spline curve. Given a normalized parameter domain $\xi \in [\xi_{\min}, \xi_{\max}]$ , a nondecreasing knot vector $\boldsymbol{\tau}$ is constructed as

$\boldsymbol{\tau} = (\underbrace{\tau_0,\dots,\tau_0}_{p+1},\tau_{p+1},\dots,\tau_{n-1},\underbrace{\tau_{n},\dots,\tau_{n}}_{p+1}),$

where $n$ is the number of basis functions and $p$ the spline degree (typically $1 \le p \le 6$ ). The Cox–de Boor recursion yields the B-spline basis $N_{i,p}(\xi)$ by staged interpolation. A least-squares fit to observed sequence samples produces coefficients $\{c_i\}_{i=0}^{n-1}$ , potentially regularized by a ridge term if the system is ill-conditioned:

$(G+\lambda I)c = B^\top y,\quad \lambda=10^{-6}\,\frac{\mathrm{tr}(G)}{n},$

where $G = B^\top B$ and $B_{\ell,i} = N_{i,p}(\xi_\ell)$ . This compact representation enables adaptive summarization of the time series.

2. Adaptive Knot Placement and Feature Preservation

BSAT allocates knots according to local signal curvature. The “mass” of curvature is computed on intervals $[\xi_\ell, \xi_{\ell+1}]$ via

$f(\xi) = (|y^{(p)}(\xi)| + \varepsilon)^{1/p},\quad w_\ell = \tfrac{1}{2}(f(\xi_\ell) + f(\xi_{\ell+1}))(\xi_{\ell+1} - \xi_\ell),$

subject to clipping to control over-concentration:

$\tilde w_\ell = \min(w_\ell, g\,\Delta F), \quad \Delta F = \frac{\sum_\ell w_\ell}{k_{\mathrm{int}}}.$

Knot positions are then selected via inversion of the cumulative distribution of $\{\tilde w_\ell\}$ at uniform quantiles, emphasizing high-curvature subsequences (e.g., peaks or rapid transitions). This methodology preserves critical features of the underlying time series that may be lost with uniform down-sampling.

3. Token Representation and Embedding Construction

Each B-spline basis function yields a token $(c_i, \mu_i)$ , where $c_i$ is its fitted coefficient and central position $\mu_i$ is computed as

$\mu_i = \frac{1}{2}(\tau_i + \tau_{i+p+1}) \times (L-1),$

transposed to the time index domain. The token is embedded into a $d_{\mathrm{model}}$ -dimensional vector by applying separate linear transformations to $c_i$ and $\mu_i$ and concatenating their outputs:

$\mathbf{u}_i = [W_c(c_i);\, W_{\mu}(\mu_i)] \in \mathbb{R}^{d_{\mathrm{model}}}.$

This configuration enables variable-length input segments to be faithfully and efficiently mapped into fixed-dimensional transformer-compatible sequences.

4. Hybrid Positional Encoding: LPE and L-RoPE

BSAT utilizes a hybrid positional encoding scheme to inject temporal information into the token stream:

Learned Absolute Positional Encoding (LPE): Each token receives a learned embedding $\mathbf{e}^{\mathrm{ord}}_{o(i)}$ based on its rank order $o(i)$ ,

$\mathbf{z}_i = \mathbf{u}_i + \mathbf{e}^{\mathrm{ord}}_{o(i)},$

maintaining relative and absolute token order within the segment.

Layer-wise Learnable Rotary Positional Embedding (L-RoPE): Extending traditional RoPE, BSAT introduces a per-layer learnable log base $\theta_{\mathrm{base}}^{(l)} = \exp(\phi^{(l)})$ for the rotary frequencies, enabling attention heads to operate with layer-specific temporal resolutions:

$f_i^{(l)} = (\theta_{\mathrm{base}}^{(l)})^{-2(i-1)/d_{\mathrm{head}}},$

with complex-plane rotational encodings applied before attention score computation. This approach equips each transformer layer to learn distinct temporal dependencies.

5. Computational Complexity and Compression Characteristics

Standard transformers scale quadratically with input length: $O(L^2)$ for $L$ tokens. BSAT reduces this by compressing the input sequence to $n \ll L$ tokens, with attention cost $O(n^2)$ . The parameter $n$ (number of basis and thereby tokens) is decoupled from $L$ and chosen according to the desired compression ratio, with no requirement for fixed uniform patching. Empirical protocols demonstrate that reducing token count from $180$ to $45$ yields an 8-fold GPU memory reduction (e.g., $1.2\,\mathrm{GiB}$ to $0.15\,\mathrm{GiB}$ ) and a $30$– $50\,\%$ decrease in epoch runtime, which is especially relevant in resource-constrained deployments.

6. Empirical Assessment and Comparative Performance

Evaluation was performed on benchmark datasets ETTh1 (hourly temperature), Alabama PV 2006 (solar output), and Electricity Load Diagrams 2011–2014 (total load), with chronological 60/20/20 train/validation/test splits. Competing baselines included Uniform Down-Sampled Transformer (UDS) and PatchTST. Hyperparameters were selected via Bayesian optimization across $200$ trials. For extreme compression ( $T=45$ tokens), BSAT with hybrid L-RoPE+LPE encoding achieved the lowest median and minimum RMSE on ETTh1 and Alabama PV benchmarks, outperforming uniform and patching baselines. Specifically, for ETTh1, BSAT L-RoPE LPE achieved $2.94\pm0.08$ RMSE, compared to UDS LPE ( $3.08\pm0.05$ ) and PatchTST LPE ( $2.98\pm0.06$ ). Performance at higher compression ratios demonstrated BSAT’s robustness in preserving predictive accuracy with significantly reduced resource requirements.

7. Applications and Contextual Significance

A principal application of BSAT is in long-term time series forecasting where the dataset size or memory constraints render conventional transformer architectures infeasible. By aligning tokenization to semantic features and distributing computational effort adaptively across the signal, BSAT supports efficient modeling of temporally heterogeneous phenomena. The hybrid positional encoding scheme allows multi-layer, multi-resolution attention to temporal dependencies, bypassing the limitations of fixed stride patching and uniform down-sampling. Empirical results indicate that BSAT can match or exceed the accuracy of conventional baselines at a fraction of the computational cost; this suggests its relevance for deployment in real-time or resource-capped forecasting pipelines (Reinwardt et al., 2 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

BSAT: B-Spline Adaptive Tokenizer for Long-Term Time Series Forecasting (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to B-Spline Adaptive Tokenizer (BSAT).