FEATHer: Adaptive Temporal Forecasting Model

Updated 20 January 2026

The paper presents FEATHer as a state-of-the-art model delivering accurate long-range forecasts on edge devices using as few as 400 parameters.
It employs a multiscale decomposition with a shared Dense Temporal Kernel and frequency-aware gating to efficiently process different signal frequencies.
Experimental results show FEATHer outperforms benchmarks across diverse time-series datasets while maintaining ultra-low latency and memory usage.

The Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer) is a time-series forecasting architecture designed for accurate long-term predictions on severely resource-constrained edge devices, such as programmable logic controllers and microcontrollers. FEATHer achieves state-of-the-art long-range forecasting accuracy with as few as 400 parameters by combining hand-crafted yet learnable multiscale decomposition, a shared depthwise convolutional temporal mixer, adaptively fused frequency-aware pathway gating, and a sparse, parameter-minimal period extrapolation kernel. Its design is motivated by the need to meet millisecond-level latency and minimal memory usage requirements in industrial and cyber-physical systems, where conventional deep sequence models are often infeasible due to hardware limitations (Lee et al., 16 Jan 2026).

1. Multiscale Input Decomposition with Frequency Pathways

FEATHer decomposes the input sequence $X \in \mathbb{R}^{L \times D}$ into four parallel, time-aligned branches, each focused on a distinct frequency band:

Point branch (b=p): $k_p=1$ $k_{p} = 1$
- $X^{(p)} = \mathrm{DWConv}_{k=1}(\mathrm{InstanceNorm}(X))$
High-frequency branch (b=h): $k_h=3$ $k_{h} = 3$
- $X^{(h)} = \mathrm{DWConv}_{k=3}(\mathrm{InstanceNorm}(X))$
Mid-frequency branch (b=m): $k_m=5$ $k_{m} = 5$
- $X^{(m)} = \mathrm{DWConv}_{k=5}(\mathrm{InstanceNorm}(X))$
Low-frequency branch (b=l): Downsample then upsample
- $X^{(l)} = \mathrm{Upsample}_L\left(\mathrm{AvgPool}_r(\mathrm{InstanceNorm}(X))\right)$ , $r=4$

Each branch’s kernel is implemented as a depthwise 1D convolution. The low-frequency branch employs stride-based average pooling followed by linear upsampling to filter out high-frequency content with no learnable parameters, resulting in a near-orthogonal filter bank that reduces cross-frequency interference under extreme parameter constraints. All pathway outputs maintain sequence length $L$ , easing later fusion.

2. Shared Dense Temporal Kernel for Lightweight Temporal Modeling

After multiscale decomposition, each branch is processed by a shared Dense Temporal Kernel (DTK), which performs “projection–depthwise convolution–reverse projection” to maintain parameter efficiency:

Linear projection: $Z^{(b)} = X^{(b)} W_{in}$ , $W_{in} \in \mathbb{R}^{D \times S}$
Depthwise temporal convolution: $U^{(b)} = \mathrm{DWConv}_{k_t}(Z^{(b)})$ , operates independently on each channel, $k_t$ typically 3 or 5
Reverse (output) projection: $H^{(b)} = U^{(b)} W_{out}$ , $W_{out} \in \mathbb{R}^{S \times D}$

Weights $\{W_{in}, W_{out}, \text{DWConv}\}$ are shared across all branches, so total parameter count does not scale with the number of frequency bands. The construction is proven globally Lipschitz with constant at most $\|W_{in}\| \cdot \|K\| \cdot \|W_{out}\|$ .

3. Frequency-Aware Branch Gating

FEATHer adapts the fusion of branch outputs via instance-dependent, spectrum-driven gating. The gating process is as follows:

Normalize $X$ across time.
Compute real FFT: $F = \mathrm{FFT}(X_{\text{norm}}) \in \mathbb{C}^{L_f \times D}$
Compute magnitude: $A = |F|$
Collapse channels: $\vec{a} = \frac{1}{D} \sum_{d=1}^D A_{:,d} \in \mathbb{R}^{L_f}$
Gating logits with a 1D ConvNet: $\vec{u} = v(\vec{a}) \in \mathbb{R}^B$
Compute softmax weights: $g_b = \exp(u_b) / \sum_{i=1}^B \exp(u_i)$

The fused representation is $H = \sum_{b=1}^B g_b H^{(b)}$ , $H \in \mathbb{R}^{L \times D}$ , where $B$ is the number of branches. The energy-based gating is derived by entropy-regularized cross-entropy minimization, resulting in a softmax policy $g_b \propto \exp(E_b/\tau)$ , where $E_b$ is the relevance (“energy”) of branch $b$ .

4. Sparse Period Kernel for Forecast Projection

To efficiently map latent states to multistep output forecasts ( $H \gg L$ ), the Sparse Period Kernel (SPK) is used:

Sliding residual aggregation: $H_{agg} = H + \mathrm{DWConv}_{k_{slide}}(H)$ , length $L$
Phase-alignment: Choose period $P$ such that $P$ divides $L$ and $H$ . For each channel, reshape $H_{agg}$ to $(P, n)$ , with $n = L/P$ , $m = H/P$
Shared linear mapping: For each phase $p=1…P$ , $Y_{p,d} = H_{agg, p, d} W \in \mathbb{R}^{m}$ , with $W \in \mathbb{R}^{n \times m}$ shared across channels
Recombine: Interleave all $P$ outputs to form $\hat{Y} \in \mathbb{R}^{H \times D}$

SPK is shown to be parameter-minimal; for phase-aligned period mappings, at least $nm$ parameters are necessary and SPK uses exactly $nm$ .

5. Parameter Efficiency, Memory, and Computational Characteristics

The FEATHer model can be instantiated with as few as $\sim 400$ trainable parameters:

Decomposition filters: Negligible (fixed, depthwise conv and pooling)
Dense Temporal Kernel: $2 D S + k_t S$ (projection matrices and conv)
Sparse Period Kernel: $n m$

For a typical “ultra-compact” configuration: $D = 1$ (univariate), $S \approx 8$ , $k_t=3$ , $L=96$ , $P=24$ , the SPK uses $24 \times (96/24 = 4) = 96$ parameters, and the total is $\lesssim 419$ . Peak RAM usage is $< 16$ KB for $D=1$ , $L=96$ , and $< 32$ KB for moderate $D$ . Inference cost is $O(D S + S k_t + P m)$ per step, all terms linear in $L$ and $D$ . On ARM Cortex-M3, latency per $96 \rightarrow \{96, 192, 336, 720\}$ forecast varies from $0.08$ ms to $1.4$ ms.

6. Experimental Results and Comparative Evaluation

FEATHer was evaluated on eight long-range multivariate time-series benchmarks:

Dataset	Granularity	Application Theme
ETTh1, ETTh2	Hourly	Electricity demand
AirQuality	Hourly	Pollution monitoring
SML	15 minutes	Indoor sensing
Weather	10 minutes	Meteorology
Solar-Energy	1 hour	Solar generation
Traffic	1 hour	Urban traffic
Electricity	1 hour	Household usage

Metrics include mean squared error (MSE), mean absolute error (MAE), and Pearson correlation (COR). FEATHer attained the best ranking in 60 cases (across 8 datasets and 4 horizons), with an average rank of $2.05$ versus $3.6$–$4.7$ for the next best baseline. On Solar-Energy, $H=720$ horizon, FEATHer’s MSE was $0.212$ compared to PatchTST’s $0.194$, but FEATHer used $<5$ K parameters whereas PatchTST used $>1$ M. Ablation studies demonstrated that each core FEATHer component contributed a $2$– $6\%$ improvement in MSE.

7. Deployment Implications for Edge Inference

FEATHer’s architectural minimalism enables real-time deployment on microcontrollers and PLCs. On ARM Cortex-M3, inference latency (for $L=96, H=96$ ) is $\approx 0.08$ ms, rising to $0.6$–$1.4$ ms for $H=720$ . Peak RAM usage is $< 16$ KB for single-channel input, and $< 32$ KB for eight-channel cases. For comparison, transformer models require $50$–$500$ KB activations and $>100$ ms latency. This resource profile satisfies stringent requirements in industrial automation, embedded control, and energy-constrained settings, where on-chip SRAM and sub-millisecond latency are indispensable (Lee et al., 16 Jan 2026). A plausible implication is that FEATHer’s approach provides a template for scalable, data-efficient forecasting in safety-critical cyber-physical systems operating with on-chip resources.

Markdown Report Issue Upgrade to Chat

References (1)

FEATHer: Fourier-Efficient Adaptive Temporal Hierarchy Forecaster for Time-Series Forecasting (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer).