Legendre Projection Unit (LPU) in Forecasting
- The Legendre Projection Unit (LPU) is a linear state-space module that uses projections onto Legendre polynomials to encode and compress historical time series data.
- It integrates a Fourier Enhanced Layer (FEL) that filters out noise by retaining low-frequency components, leading to significant reductions in mean-squared error.
- Low-rank factorization of the FEL parameters ensures computational efficiency and enables seamless integration within various deep learning backbones for forecasting.
The Legendre Projection Unit (LPU) is a linear state-space module that implements efficient, compact capture of temporal history in deep learning models for time series forecasting. Originating as a component of the FiLM (Frequency improved Legendre Memory Model) architecture, LPU utilizes projections onto the Legendre polynomial basis to encode historical input windows, resulting in a compressed, structured memory representation. This mechanism is mathematically grounded in continuous-time and discrete-time expansions and integrates seamlessly as a plugin module within common neural backbones, enabling preservation of long-term dependencies and mitigating overfitting to noise, as substantiated in empirical benchmarks (Zhou et al., 2022).
1. Mathematical Foundation and State-Space Construction
The LPU is based on a Legendre expansion of observed temporal segments. Given observations of a scalar time series over a sliding window , the model approximates the windowed signal via
where denotes the -th Legendre polynomial on .
The vector of coefficients evolves dynamically: where entries of and are, respectively,
Direct discretization (Tustin or zero-order-hold, with step ) provides the update equation
where is the memory state, and is the current input.
Reconstruction of (compressed) signal segments is achieved via
or equivalently, for the precomputed matrix of Legendre polynomials at sampled locations.
Approximation error decays as for -Lipschitz and as for functions with bounded derivatives.
2. Frequency Filtering and the Fourier Enhanced Layer
The LPU by construction can retain both trend and noise. FiLM augments the LPU with a data-driven noise suppression mechanism after the memory computation: the Fourier Enhanced Layer (FEL). FEL operates by applying the discrete Fourier transform to each channel , retaining only the lowest frequencies, optionally weighting them with learnable complex weights , and inverting via the inverse DFT: Retention of low-frequency modes leverages the spectral bias of real signals toward lower frequencies and suppresses spectrally flat white noise. Theoretical analysis establishes that with proper low-frequency selection, the energy of the subspace is almost preserved, and empirical ablations confirm its importance for forecast accuracy.
3. Low-Rank Parameterization and Computational Efficiency
Naively, the FEL would require a learnable tensor , yielding prohibitive storage for high-dimensional projections. To ensure computational tractability, FiLM factorizes using a low-rank decomposition: where . This compression reduces storage from to , with empirical studies showing negligible degradation in MSE (at most 1–2%) at extreme compression (, , accounts for of the naive parameter count). The default settings recommended are , , .
4. Architecture, Integration, and Hyperparameters
In FiLM, a complete forward pass for time series data proceeds as
- (Optional) RevIN normalization,
- LPU encoding,
- FEL for noise removal,
- LPU_R (inverse of LPU) for reconstruction,
- (Optional) RevIN denormalization,
- Forecast output .
A multi-scale mixture-of-experts design can be realized by running three parallel FiLM blocks on input windows of length , each predicting the next steps, then linearly combining the predictions. RevIN normalization is optional.
The recommended hyperparameters are summarized as follows:
| Component | Default Value | Purpose |
|---|---|---|
| Legendre order | 256 | LPU memory size |
| Fourier modes | 32 | FEL frequency truncation |
| FEL low-rank | 4 | Parameter efficiency |
| Mixture scales | Multi-scale experts |
5. Training Protocols and Objectives
Training objectives primarily employ mean-squared-error (MSE) loss: Optionally, weight decay can be applied. Optimization utilizes Adam with learning rates in to , batch sizes up to $256$ (subsets of 32–64 for 16GB GPUs), across 15 epochs, selecting the model with the best validation MSE. No curriculum or special warm-up scheduling is required.
6. Empirical Performance and Ablations
On six multivariate benchmarks (ETTm1, ETTm2, Electricity, Traffic, Exchange, Weather, ILI) and forecast horizons , FiLM achieves a 20.3% average reduction in MSE versus the FEDformer Transformer baseline. Univariate experiments yield a 22.6% MSE reduction.
Ablation studies demonstrate:
- Substituting the LPU with an equally sized linear layer degrades performance by MSE (Table 3). Across MLP, LSTM, CNN, and Self-Attention backbones, LPU integration provides relative improvements from to .
- Replacing the FEL with MLP, LSTM, CNN, or vanilla Attention increases MSE up to .
- Compressing the FEL using low-rank factorization up to of parameters increases MSE by at most (Table 5).
- Selection of the lowest Fourier modes is most stable; mixing in random high modes marginally helps on select datasets (Table 6).
- Multiscale mixture yields up to 19% improvement; RevIN filtering is dataset-dependent.
- Computational complexity is linear in input length, and FiLM achieves per-epoch training speeds faster than state-of-the-art Transformer baselines for long sequences (V100 32GB, FiLM batch=256 vs. FEDformer batch=32), while also being more accurate.
7. Practical Use and Modular Integration
The LPU can serve as a front-end compression module, paired with its decoder (LPU LPU_R), for any backbone—MLP, RNN, TCN, or Transformer-based—facilitating long-term historical retention. The FEL can function as a general denoising layer between other modules, employing FFT-based filtering and learnable frequency weighting.
Low-rank factorization of FEL should be tuned via cross-validation, with values in $4–16$ commonly yielding effective tradeoffs between capacity and efficiency. For architectures already partitioning seasonal and trend components (e.g., Autoformer, FEDformer), the FiLM block can substitute for trend modules. Initial experiments should deploy the default FiLM block and can subsequently assess the contributions of isolated LPU or FEL layers.
By combining provably accurate Legendre-memory encoding, learned spectral-noise filtering, and efficient parametrization, the Legendre Projection Unit (in the context of FiLM) establishes a lightweight, high-performing, and modular paradigm for long-term forecasting (Zhou et al., 2022).