Papers
Topics
Authors
Recent
Search
2000 character limit reached

FEATHer: Adaptive Temporal Forecasting Model

Updated 20 January 2026
  • The paper presents FEATHer as a state-of-the-art model delivering accurate long-range forecasts on edge devices using as few as 400 parameters.
  • It employs a multiscale decomposition with a shared Dense Temporal Kernel and frequency-aware gating to efficiently process different signal frequencies.
  • Experimental results show FEATHer outperforms benchmarks across diverse time-series datasets while maintaining ultra-low latency and memory usage.

The Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer) is a time-series forecasting architecture designed for accurate long-term predictions on severely resource-constrained edge devices, such as programmable logic controllers and microcontrollers. FEATHer achieves state-of-the-art long-range forecasting accuracy with as few as 400 parameters by combining hand-crafted yet learnable multiscale decomposition, a shared depthwise convolutional temporal mixer, adaptively fused frequency-aware pathway gating, and a sparse, parameter-minimal period extrapolation kernel. Its design is motivated by the need to meet millisecond-level latency and minimal memory usage requirements in industrial and cyber-physical systems, where conventional deep sequence models are often infeasible due to hardware limitations (Lee et al., 16 Jan 2026).

1. Multiscale Input Decomposition with Frequency Pathways

FEATHer decomposes the input sequence XRL×DX \in \mathbb{R}^{L \times D} into four parallel, time-aligned branches, each focused on a distinct frequency band:

  • Point branch (b=p): kp=1k_p=1
    • X(p)=DWConvk=1(InstanceNorm(X))X^{(p)} = \mathrm{DWConv}_{k=1}(\mathrm{InstanceNorm}(X))
  • High-frequency branch (b=h): kh=3k_h=3
    • X(h)=DWConvk=3(InstanceNorm(X))X^{(h)} = \mathrm{DWConv}_{k=3}(\mathrm{InstanceNorm}(X))
  • Mid-frequency branch (b=m): km=5k_m=5
    • X(m)=DWConvk=5(InstanceNorm(X))X^{(m)} = \mathrm{DWConv}_{k=5}(\mathrm{InstanceNorm}(X))
  • Low-frequency branch (b=l): Downsample then upsample
    • X(l)=UpsampleL(AvgPoolr(InstanceNorm(X)))X^{(l)} = \mathrm{Upsample}_L\left(\mathrm{AvgPool}_r(\mathrm{InstanceNorm}(X))\right), r=4r=4

Each branch’s kernel is implemented as a depthwise 1D convolution. The low-frequency branch employs stride-based average pooling followed by linear upsampling to filter out high-frequency content with no learnable parameters, resulting in a near-orthogonal filter bank that reduces cross-frequency interference under extreme parameter constraints. All pathway outputs maintain sequence length LL, easing later fusion.

2. Shared Dense Temporal Kernel for Lightweight Temporal Modeling

After multiscale decomposition, each branch is processed by a shared Dense Temporal Kernel (DTK), which performs “projection–depthwise convolution–reverse projection” to maintain parameter efficiency:

  • Linear projection: Z(b)=X(b)WinZ^{(b)} = X^{(b)} W_{in}, WinRD×SW_{in} \in \mathbb{R}^{D \times S}
  • Depthwise temporal convolution: U(b)=DWConvkt(Z(b))U^{(b)} = \mathrm{DWConv}_{k_t}(Z^{(b)}), operates independently on each channel, ktk_t typically 3 or 5
  • Reverse (output) projection: H(b)=U(b)WoutH^{(b)} = U^{(b)} W_{out}, WoutRS×DW_{out} \in \mathbb{R}^{S \times D}

Weights {Win,Wout,DWConv}\{W_{in}, W_{out}, \text{DWConv}\} are shared across all branches, so total parameter count does not scale with the number of frequency bands. The construction is proven globally Lipschitz with constant at most WinKWout\|W_{in}\| \cdot \|K\| \cdot \|W_{out}\|.

3. Frequency-Aware Branch Gating

FEATHer adapts the fusion of branch outputs via instance-dependent, spectrum-driven gating. The gating process is as follows:

  1. Normalize XX across time.
  2. Compute real FFT: F=FFT(Xnorm)CLf×DF = \mathrm{FFT}(X_{\text{norm}}) \in \mathbb{C}^{L_f \times D}
  3. Compute magnitude: A=FA = |F|
  4. Collapse channels: a=1Dd=1DA:,dRLf\vec{a} = \frac{1}{D} \sum_{d=1}^D A_{:,d} \in \mathbb{R}^{L_f}
  5. Gating logits with a 1D ConvNet: u=v(a)RB\vec{u} = v(\vec{a}) \in \mathbb{R}^B
  6. Compute softmax weights: gb=exp(ub)/i=1Bexp(ui)g_b = \exp(u_b) / \sum_{i=1}^B \exp(u_i)

The fused representation is H=b=1BgbH(b)H = \sum_{b=1}^B g_b H^{(b)}, HRL×DH \in \mathbb{R}^{L \times D}, where BB is the number of branches. The energy-based gating is derived by entropy-regularized cross-entropy minimization, resulting in a softmax policy gbexp(Eb/τ)g_b \propto \exp(E_b/\tau), where EbE_b is the relevance (“energy”) of branch bb.

4. Sparse Period Kernel for Forecast Projection

To efficiently map latent states to multistep output forecasts (HLH \gg L), the Sparse Period Kernel (SPK) is used:

  • Sliding residual aggregation: Hagg=H+DWConvkslide(H)H_{agg} = H + \mathrm{DWConv}_{k_{slide}}(H), length LL
  • Phase-alignment: Choose period PP such that PP divides LL and HH. For each channel, reshape HaggH_{agg} to (P,n)(P, n), with n=L/Pn = L/P, m=H/Pm = H/P
  • Shared linear mapping: For each phase p=1Pp=1…P, Yp,d=Hagg,p,dWRmY_{p,d} = H_{agg, p, d} W \in \mathbb{R}^{m}, with WRn×mW \in \mathbb{R}^{n \times m} shared across channels
  • Recombine: Interleave all PP outputs to form Y^RH×D\hat{Y} \in \mathbb{R}^{H \times D}

SPK is shown to be parameter-minimal; for phase-aligned period mappings, at least nmnm parameters are necessary and SPK uses exactly nmnm.

5. Parameter Efficiency, Memory, and Computational Characteristics

The FEATHer model can be instantiated with as few as 400\sim 400 trainable parameters:

  • Decomposition filters: Negligible (fixed, depthwise conv and pooling)
  • Dense Temporal Kernel: 2DS+ktS2 D S + k_t S (projection matrices and conv)
  • Sparse Period Kernel: nmn m

For a typical “ultra-compact” configuration: D=1D = 1 (univariate), S8S \approx 8, kt=3k_t=3, L=96L=96, P=24P=24, the SPK uses 24×(96/24=4)=9624 \times (96/24 = 4) = 96 parameters, and the total is 419\lesssim 419. Peak RAM usage is <16< 16 KB for D=1D=1, L=96L=96, and <32< 32 KB for moderate DD. Inference cost is O(DS+Skt+Pm)O(D S + S k_t + P m) per step, all terms linear in LL and DD. On ARM Cortex-M3, latency per 96{96,192,336,720}96 \rightarrow \{96, 192, 336, 720\} forecast varies from $0.08$ ms to $1.4$ ms.

6. Experimental Results and Comparative Evaluation

FEATHer was evaluated on eight long-range multivariate time-series benchmarks:

Dataset Granularity Application Theme
ETTh1, ETTh2 Hourly Electricity demand
AirQuality Hourly Pollution monitoring
SML 15 minutes Indoor sensing
Weather 10 minutes Meteorology
Solar-Energy 1 hour Solar generation
Traffic 1 hour Urban traffic
Electricity 1 hour Household usage

Metrics include mean squared error (MSE), mean absolute error (MAE), and Pearson correlation (COR). FEATHer attained the best ranking in 60 cases (across 8 datasets and 4 horizons), with an average rank of $2.05$ versus $3.6$–$4.7$ for the next best baseline. On Solar-Energy, H=720H=720 horizon, FEATHer’s MSE was $0.212$ compared to PatchTST’s $0.194$, but FEATHer used <5<5K parameters whereas PatchTST used >1>1M. Ablation studies demonstrated that each core FEATHer component contributed a $2$–6%6\% improvement in MSE.

7. Deployment Implications for Edge Inference

FEATHer’s architectural minimalism enables real-time deployment on microcontrollers and PLCs. On ARM Cortex-M3, inference latency (for L=96,H=96L=96, H=96) is 0.08\approx 0.08 ms, rising to $0.6$–$1.4$ ms for H=720H=720. Peak RAM usage is <16< 16 KB for single-channel input, and <32< 32 KB for eight-channel cases. For comparison, transformer models require $50$–$500$ KB activations and >100>100 ms latency. This resource profile satisfies stringent requirements in industrial automation, embedded control, and energy-constrained settings, where on-chip SRAM and sub-millisecond latency are indispensable (Lee et al., 16 Jan 2026). A plausible implication is that FEATHer’s approach provides a template for scalable, data-efficient forecasting in safety-critical cyber-physical systems operating with on-chip resources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer).