Papers
Topics
Authors
Recent
Search
2000 character limit reached

Legendre Projection Unit (LPU) in Forecasting

Updated 16 March 2026
  • The Legendre Projection Unit (LPU) is a linear state-space module that uses projections onto Legendre polynomials to encode and compress historical time series data.
  • It integrates a Fourier Enhanced Layer (FEL) that filters out noise by retaining low-frequency components, leading to significant reductions in mean-squared error.
  • Low-rank factorization of the FEL parameters ensures computational efficiency and enables seamless integration within various deep learning backbones for forecasting.

The Legendre Projection Unit (LPU) is a linear state-space module that implements efficient, compact capture of temporal history in deep learning models for time series forecasting. Originating as a component of the FiLM (Frequency improved Legendre Memory Model) architecture, LPU utilizes projections onto the Legendre polynomial basis to encode historical input windows, resulting in a compressed, structured memory representation. This mechanism is mathematically grounded in continuous-time and discrete-time expansions and integrates seamlessly as a plugin module within common neural backbones, enabling preservation of long-term dependencies and mitigating overfitting to noise, as substantiated in empirical benchmarks (Zhou et al., 2022).

1. Mathematical Foundation and State-Space Construction

The LPU is based on a Legendre expansion of observed temporal segments. Given observations of a scalar time series f(t)f(t) over a sliding window [t−θ,t][t - \theta, t], the model approximates the windowed signal via

f(x)∣x∈[t−θ,t]≈g(t)(x)=∑n=0N−1cn(t)Pn(2(x−t)θ+1)f(x)|_{x \in [t-\theta, t]} \approx g^{(t)}(x) = \sum_{n=0}^{N-1} c_n(t) P_n \Bigl(\frac{2(x-t)}{\theta} + 1\Bigr)

where Pn(â‹…)P_n(\cdot) denotes the nn-th Legendre polynomial on [0,1][0,1].

The vector of coefficients c(t)∈RN\mathbf{c}(t) \in \mathbb{R}^N evolves dynamically: ddt c(t)=−1θA c(t)+1θB f(t)\frac{d}{dt}\ \mathbf{c}(t) = -\frac{1}{\theta} A\, \mathbf{c}(t) + \frac{1}{\theta} B\, f(t) where entries of A∈RN×NA \in \mathbb{R}^{N \times N} and B∈RN×1B \in \mathbb{R}^{N \times 1} are, respectively,

An,k=(2n+1){(−1)n−k,k≤n 1,k>n,Bn=(2n+1)(−1)nA_{n,k} = (2n + 1)\begin{cases} (-1)^{n-k}, & k \leq n \ 1, & k > n \end{cases}, \quad B_n = (2n + 1)(-1)^n

Direct discretization (Tustin or zero-order-hold, with step Δ=1\Delta = 1) provides the update equation

Ct+1=Ad Ct+Bd x(t)\mathbf{C}_{t+1} = A_d\, \mathbf{C}_t + B_d\, x(t)

where Ct\mathbf{C}_t is the memory state, and x(t)=f(t)x(t) = f(t) is the current input.

Reconstruction of (compressed) signal segments is achieved via

x^(t)=∑n=0N−1⟨Ct,n⟩Pn(2(x−t)θ+1),\hat x(t) = \sum_{n=0}^{N-1} \langle \mathbf{C}_t, n \rangle P_n \Bigl(\frac{2(x-t)}{\theta} + 1\Bigr),

or equivalently, x^=Ct⊤R\hat x = \mathbf{C}_t^\top R for RR the precomputed matrix of Legendre polynomials at sampled locations.

Approximation error decays as O(θLN)\mathcal{O}(\frac{\theta L}{\sqrt{N}}) for LL-Lipschitz ff and as O(θkN−k+1/2)\mathcal{O}(\theta^k N^{-k + 1/2}) for functions with kk bounded derivatives.

2. Frequency Filtering and the Fourier Enhanced Layer

The LPU by construction can retain both trend and noise. FiLM augments the LPU with a data-driven noise suppression mechanism after the memory computation: the Fourier Enhanced Layer (FEL). FEL operates by applying the discrete Fourier transform to each channel cn(1),…,cn(T)c_n(1), \ldots, c_n(T), retaining only the lowest MM frequencies, optionally weighting them with learnable complex weights Wn,mW_{n,m}, and inverting via the inverse DFT: c~n[k]=∑t=1Tcn(t)e−2πitk/T,c^n(t)=1T∑k=0T−1c~n[k]e+2πitk/T\widetilde{c}_n[k] = \sum_{t=1}^T c_n(t) e^{-2\pi i t k / T}, \quad \hat{c}_n(t) = \frac{1}{T} \sum_{k=0}^{T-1} \widetilde{c}_n[k] e^{+2\pi i t k / T} Retention of low-frequency modes leverages the spectral bias of real signals toward lower frequencies and suppresses spectrally flat white noise. Theoretical analysis establishes that with proper low-frequency selection, the energy of the subspace is almost preserved, and empirical ablations confirm its importance for forecast accuracy.

3. Low-Rank Parameterization and Computational Efficiency

Naively, the FEL would require a learnable tensor W∈RM×N×NW \in \mathbb{R}^{M \times N \times N}, yielding prohibitive storage for high-dimensional projections. To ensure computational tractability, FiLM factorizes WW using a low-rank decomposition: Wm,n,p≈∑a=1rWm,a1Wa,n2Wa,p3W_{m, n, p} \approx \sum_{a=1}^r W^1_{m, a} W^2_{a, n} W^3_{a, p} where r≪Nr \ll N. This compression reduces storage from O(MN2)O(M N^2) to O(Mr+2rN)O(M r + 2 r N), with empirical studies showing negligible degradation in MSE (at most 1–2%) at extreme compression (N=256N=256, M=32M=32, r=4r=4 accounts for 0.4%0.4\% of the naive parameter count). The default settings recommended are N=256N=256, M=32M=32, r=4r=4.

4. Architecture, Integration, and Hyperparameters

In FiLM, a complete forward pass for time series data X∈RT×DX \in \mathbb{R}^{T \times D} proceeds as

  • (Optional) RevIN normalization,
  • LPU encoding,
  • FEL for noise removal,
  • LPU_R (inverse of LPU) for reconstruction,
  • (Optional) RevIN denormalization,
  • Forecast output Y^∈RT×D\hat{Y} \in \mathbb{R}^{T \times D}.

A multi-scale mixture-of-experts design can be realized by running three parallel FiLM blocks on input windows of length T,2T,4TT, 2T, 4T, each predicting the next TT steps, then linearly combining the predictions. RevIN normalization is optional.

The recommended hyperparameters are summarized as follows:

Component Default Value Purpose
Legendre order NN 256 LPU memory size
Fourier modes MM 32 FEL frequency truncation
FEL low-rank rr 4 Parameter efficiency
Mixture scales {T,2T,4T}\{T, 2T, 4T\} Multi-scale experts

5. Training Protocols and Objectives

Training objectives primarily employ mean-squared-error (MSE) loss: LMSE=1DT∑d=1D∑t=1T(Yt,d−Y^t,d)2L_{\text{MSE}} = \frac{1}{D T} \sum_{d=1}^D \sum_{t=1}^T (Y_{t,d} - \hat{Y}_{t,d})^2 Optionally, L2L_2 weight decay can be applied. Optimization utilizes Adam with learning rates in 10−410^{-4} to 10−310^{-3}, batch sizes up to $256$ (subsets of 32–64 for 16GB GPUs), across 15 epochs, selecting the model with the best validation MSE. No curriculum or special warm-up scheduling is required.

6. Empirical Performance and Ablations

On six multivariate benchmarks (ETTm1, ETTm2, Electricity, Traffic, Exchange, Weather, ILI) and forecast horizons {96,192,336,720}\{96, 192, 336, 720\}, FiLM achieves a 20.3% average reduction in MSE versus the FEDformer Transformer baseline. Univariate experiments yield a 22.6% MSE reduction.

Ablation studies demonstrate:

  • Substituting the LPU with an equally sized linear layer degrades performance by 30–100%30–100\% MSE (Table 3). Across MLP, LSTM, CNN, and Self-Attention backbones, LPU integration provides relative improvements from 8%8\% to 119%119\%.
  • Replacing the FEL with MLP, LSTM, CNN, or vanilla Attention increases MSE up to +300%+300\%.
  • Compressing the FEL using low-rank factorization up to 0.4%0.4\% of parameters increases MSE by at most 1–2%1–2\% (Table 5).
  • Selection of the lowest MM Fourier modes is most stable; mixing in random high modes marginally helps on select datasets (Table 6).
  • Multiscale mixture yields up to 19% improvement; RevIN filtering is dataset-dependent.
  • Computational complexity is linear in input length, and FiLM achieves per-epoch training speeds 1−2×1-2\times faster than state-of-the-art Transformer baselines for long sequences (V100 32GB, FiLM batch=256 vs. FEDformer batch=32), while also being more accurate.

7. Practical Use and Modular Integration

The LPU can serve as a front-end compression module, paired with its decoder (LPU →\rightarrow LPU_R), for any backbone—MLP, RNN, TCN, or Transformer-based—facilitating long-term historical retention. The FEL can function as a general denoising layer between other modules, employing FFT-based filtering and learnable frequency weighting.

Low-rank factorization of FEL should be tuned via cross-validation, with rr values in $4–16$ commonly yielding effective tradeoffs between capacity and efficiency. For architectures already partitioning seasonal and trend components (e.g., Autoformer, FEDformer), the FiLM block can substitute for trend modules. Initial experiments should deploy the default FiLM block and can subsequently assess the contributions of isolated LPU or FEL layers.

By combining provably accurate Legendre-memory encoding, learned spectral-noise filtering, and efficient parametrization, the Legendre Projection Unit (in the context of FiLM) establishes a lightweight, high-performing, and modular paradigm for long-term forecasting (Zhou et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Legendre Projection Unit (LPU).