Papers
Topics
Authors
Recent
Search
2000 character limit reached

Frequency Improved Legendre Memory Model (FiLM)

Updated 16 March 2026
  • FiLM integrates Legendre polynomial-based memory with Fourier denoising and low-rank methods, achieving up to 22.6% MSE reduction in forecasts.
  • Its modular design allows easy plug-in with existing models, boosting performance on both multivariate and univariate benchmarks.
  • Empirical evaluations demonstrate significant efficiency gains with 80% fewer parameters and linear scaling in memory usage and training time.

The Frequency Improved Legendre Memory Model (FiLM) is a neural architecture for long-term time series forecasting that integrates Legendre polynomial projections, Fourier-based denoising, and low-rank parameterization. FiLM systematically enhances the representation and utilization of historical information within deep time-series models, delivering accuracy and efficiency gains over contemporary alternatives such as FEDformer, Autoformer, and S4. Its modular design enables direct integration as a plug-in layer for existing deep learning forecasters, and empirical results demonstrate significant improvements in both multivariate and univariate forecasting benchmarks (Zhou et al., 2022).

1. Legendre Memory Model: Theoretical Foundations

FiLM builds upon the Legendre Memory Model (LMM), which encodes the recent history of an input time series x(t)x(t) via projection onto a fixed number of shifted-and-scaled Legendre polynomials. For a time window [t−θ,t][t-\theta, t], the model compresses the historical segment into a vector of coefficients c(t)∈RNc(t) \in \mathbb{R}^N as: cn(t)=⟨x(s),Pn(2(s−t)θ+1)⟩,n=0,…,N−1c_n(t) = \langle x(s), P_n\left(\frac{2(s-t)}{\theta} + 1\right) \rangle, \quad n = 0, \dots, N-1 where PnP_n denotes the Legendre polynomial of degree nn.

The coefficient dynamics follow an ODE: ddtc(t)=−1θAc(t)+1θBx(t)\frac{d}{dt}c(t) = -\frac{1}{\theta}A c(t) + \frac{1}{\theta}B x(t) with A,BA, B determined by Legendre recurrence. Discretization yields the update: ct=Adct−1+Bdxtc_t = A_d c_{t-1} + B_d x_t where

Ad=(I+Δt2θA)−1(I−Δt2θA),Bd=(I+Δt2θA)−1ΔtθBA_d = \left(I+\frac{\Delta t}{2\theta}A\right)^{-1}\left(I-\frac{\Delta t}{2\theta}A\right), \quad B_d = \left(I+\frac{\Delta t}{2\theta}A\right)^{-1} \frac{\Delta t}{\theta} B

Analytic forms for AA and BB are given by: An,k=(2n+1){(−1)n−k,k≤n 1,k>n,Bn=(2n+1)(−1)nA_{n,k} = (2n+1)\begin{cases} (-1)^{n-k}, & k\leq n \ 1, & k>n \end{cases},\quad B_n = (2n+1)(-1)^n At inference, an approximation of the original signal is reconstructible as: x^(s)=∑n=0N−1cn(t)Pn(2(s−t)θ+1)\hat{x}(s) = \sum_{n=0}^{N-1} c_n(t) P_n\left(\frac{2(s-t)}{\theta} + 1\right)

2. Frequency Improvement via Fourier-Based Denoising

While Legendre projection preserves all frequencies—including noise—FiLM introduces a Fourier-based denoising module (Frequency Enhanced Layer, FEL). For each feature channel, an FFT is computed along the Legendre-index axis: F{C}[k]=∑n=0N−1C[n] e−2πikn/N,k=0,...,⌊N/2⌋\mathcal{F}\{C\}[k] = \sum_{n=0}^{N-1} C[n]\, e^{-2\pi i k n/N}, \quad k = 0, ..., \lfloor N/2 \rfloor Only the lowest MM modes are retained, weighted by learnable parameters W[k]W[k]: C~f[k]=W[k] F{C}[k],k=0,...,M−1\widetilde{C}_f[k] = W[k]\,\mathcal{F}\{C\}[k], \quad k = 0, ..., M-1 Higher modes (k≥Mk \geq M) are zeroed, and inverse FFT reconstructs a denoised memory representation: C′(n)=∑k=0M−1C~f[k] e2πikn/NC'(n) = \sum_{k=0}^{M-1} \widetilde{C}_f[k]\, e^{2\pi i k n/N} This process robustly suppresses high-frequency noise while maintaining salient long-term components.

3. Low-Rank Parameterization for Efficiency

Naïvely, the learnable weights W∈RD×M×DW \in \mathbb{R}^{D \times M \times D} grow prohibitively large for high-dimensional problems. FiLM addresses this via tensor factorization: W≈W2W1W0W \approx W_2 W_1 W_0 with W0∈RD×rW_0 \in \mathbb{R}^{D \times r}, W1∈Rr×r×MW_1 \in \mathbb{R}^{r \times r \times M}, W2∈Rr×DW_2 \in \mathbb{R}^{r \times D}, and r≪Dr \ll D the low-rank. For each mode kk: C~f[k]=W2(W1[k](W0⊤F{C}[k]))\widetilde{C}_f[k] = W_2\left(W_1[k](W_0^\top \mathcal{F}\{C\}[k])\right) This reduces parameter count from O(D2M)O(D^2 M) to O(rD+r2M+rD)O(rD + r^2 M + rD). Empirically, r=4r=4 (0.41% of full size) yields negligible loss in MSE; even r=1r=1 provides strong compression with minor performance reduction.

4. Model Architecture and Training Protocols

4.1 Single-Layer Block

A one-layer FiLM block consists of:

  • Legendre Projection Unit (LPU): Produces Legendre coefficient sequence CC.
  • Frequency Enhanced Layer (FEL): Applies the Fourier mask described above, yielding denoised C′C'.
  • LPU_R: Reconstructs the forecast using the inverse Legendre-basis mapping.

4.2 Multiscale Mixture-of-Experts

FiLM processes histories at several time resolutions (e.g., T,2T,4TT, 2T, 4T), with each block forecasting separate future windows; outputs are combined via a learned gating mechanism, capturing information from both medium- and long-range dependencies.

4.3 Optional Pre/Post-Processing

Per-series Instance Normalization (RevIN) can be applied before and after FiLM to enhance robustness to distribution shift. Its use is dataset-dependent.

4.4 Default Hyperparameters

Component Default Value Notes
Legendre dim. NN 256 Number of polynomial bases
Fourier modes MM 32 Number of frequencies retained
Low-rank rr 4 Tradeoff param.
Scales 3 T,2T,4TT, 2T, 4T
Batch size 32–256 Task dependent
Optimizer Adam Learning rate schedule 10−3→10−410^{-3} \rightarrow 10^{-4} over 15 epochs

4.5 Training Objective

The model is trained using Mean Squared Error (MSE): L=1N∑i=1N∥y^i−yi∥2\mathcal{L} = \frac{1}{N}\sum_{i=1}^N \| \hat{y}_i - y_i \|^2 No curriculum schedules or special warm-up phases are used. MSE is the primary loss; Mean Absolute Error (MAE) is reported but not optimized.

5. Empirical Evaluation and Ablation Analysis

5.1 Comparative Benchmarks

Across six real-world datasets (Traffic, Electricity, Exchange, Weather, ILI, ETTm/ETTh) and a range of forecast horizons, FiLM demonstrates substantial error reductions relative to prior SOTA models:

Task MSE Reduction vs Best Prior
Multivariate 20.3% (vs FEDformer)
Univariate 22.6%

Seven competitive baselines are evaluated, including FEDformer, Autoformer, Informer, S4, LogTrans, and Reformer.

5.2 Module Drop-In and Substitution

  • Substituting the LPU for a linear layer degrades all architectures tested.
  • Augmenting existing MLP, LSTM, CNN, or Transformer networks with LPU and FEL offers consistent and large MSE improvements (8–120% relative gain).

5.3 Component Ablations

  • Replacing FEL by standard MLP, LSTM, CNN, or vanilla Attention yields 5–300% worse performance.
  • Reducing low-rank rr from 256 to 4 compresses weights to 0.41% of baseline with <1% MSE increase; r=1r=1 achieves within 5% of full performance.
  • Limiting to lowest MM Fourier modes is robust; some datasets benefit from including a small fraction of higher modes.

5.4 Efficiency

  • Parameter count: FiLM (r=4)(r=4) uses 80% fewer trainable weights than FEDformer.
  • Memory usage and training time scale linearly in input length, outperforming deeper competitors by ca. 50% per epoch.

6. Integration with Existing Time Series Forecasters

FiLM's memory and denoising modules can be embedded into arbitrary forecasting architectures:

  1. Prepend LPU: Replace raw series X(t)X(t) with Legendre state c(t)c(t).
  2. Apply FEL: Perform the Fourier-based denoising as described.
  3. Decode: Use reconstructed features or pass c′(t)c'(t) to the backbone forecaster.

Empirical evidence shows up to 120% MSE reduction as a plug-in to existing MLP, LSTM, CNN, and vanilla attention models, with negligible parameter overhead (ca. 0.5% of full model).

7. Significance and Implications

FiLM demonstrates that Legendre polynomial-based memory, augmented with frequency selection and low-rank adaptation, offers a principled and practical approach for long-term sequence modeling. It balances expressiveness and regularization, efficiently attenuates overfitting to noise, and is broadly applicable as a module across network architectures. FiLM's empirical performance on real-world datasets and its ablation support the centrality of structured memory and frequency-aware denoising in advancing time-series forecasting (Zhou et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Frequency Improved Legendre Memory Model (FiLM).