FEATHer: Adaptive Temporal Forecasting Model
- The paper presents FEATHer as a state-of-the-art model delivering accurate long-range forecasts on edge devices using as few as 400 parameters.
- It employs a multiscale decomposition with a shared Dense Temporal Kernel and frequency-aware gating to efficiently process different signal frequencies.
- Experimental results show FEATHer outperforms benchmarks across diverse time-series datasets while maintaining ultra-low latency and memory usage.
The Fourier-Efficient Adaptive Temporal Hierarchy Forecaster (FEATHer) is a time-series forecasting architecture designed for accurate long-term predictions on severely resource-constrained edge devices, such as programmable logic controllers and microcontrollers. FEATHer achieves state-of-the-art long-range forecasting accuracy with as few as 400 parameters by combining hand-crafted yet learnable multiscale decomposition, a shared depthwise convolutional temporal mixer, adaptively fused frequency-aware pathway gating, and a sparse, parameter-minimal period extrapolation kernel. Its design is motivated by the need to meet millisecond-level latency and minimal memory usage requirements in industrial and cyber-physical systems, where conventional deep sequence models are often infeasible due to hardware limitations (Lee et al., 16 Jan 2026).
1. Multiscale Input Decomposition with Frequency Pathways
FEATHer decomposes the input sequence into four parallel, time-aligned branches, each focused on a distinct frequency band:
- Point branch (b=p):
- High-frequency branch (b=h):
- Mid-frequency branch (b=m):
- Low-frequency branch (b=l): Downsample then upsample
- ,
Each branch’s kernel is implemented as a depthwise 1D convolution. The low-frequency branch employs stride-based average pooling followed by linear upsampling to filter out high-frequency content with no learnable parameters, resulting in a near-orthogonal filter bank that reduces cross-frequency interference under extreme parameter constraints. All pathway outputs maintain sequence length , easing later fusion.
2. Shared Dense Temporal Kernel for Lightweight Temporal Modeling
After multiscale decomposition, each branch is processed by a shared Dense Temporal Kernel (DTK), which performs “projection–depthwise convolution–reverse projection” to maintain parameter efficiency:
- Linear projection: ,
- Depthwise temporal convolution: , operates independently on each channel, typically 3 or 5
- Reverse (output) projection: ,
Weights are shared across all branches, so total parameter count does not scale with the number of frequency bands. The construction is proven globally Lipschitz with constant at most .
3. Frequency-Aware Branch Gating
FEATHer adapts the fusion of branch outputs via instance-dependent, spectrum-driven gating. The gating process is as follows:
- Normalize across time.
- Compute real FFT:
- Compute magnitude:
- Collapse channels:
- Gating logits with a 1D ConvNet:
- Compute softmax weights:
The fused representation is , , where is the number of branches. The energy-based gating is derived by entropy-regularized cross-entropy minimization, resulting in a softmax policy , where is the relevance (“energy”) of branch .
4. Sparse Period Kernel for Forecast Projection
To efficiently map latent states to multistep output forecasts (), the Sparse Period Kernel (SPK) is used:
- Sliding residual aggregation: , length
- Phase-alignment: Choose period such that divides and . For each channel, reshape to , with ,
- Shared linear mapping: For each phase , , with shared across channels
- Recombine: Interleave all outputs to form
SPK is shown to be parameter-minimal; for phase-aligned period mappings, at least parameters are necessary and SPK uses exactly .
5. Parameter Efficiency, Memory, and Computational Characteristics
The FEATHer model can be instantiated with as few as trainable parameters:
- Decomposition filters: Negligible (fixed, depthwise conv and pooling)
- Dense Temporal Kernel: (projection matrices and conv)
- Sparse Period Kernel:
For a typical “ultra-compact” configuration: (univariate), , , , , the SPK uses parameters, and the total is . Peak RAM usage is KB for , , and KB for moderate . Inference cost is per step, all terms linear in and . On ARM Cortex-M3, latency per forecast varies from $0.08$ ms to $1.4$ ms.
6. Experimental Results and Comparative Evaluation
FEATHer was evaluated on eight long-range multivariate time-series benchmarks:
| Dataset | Granularity | Application Theme |
|---|---|---|
| ETTh1, ETTh2 | Hourly | Electricity demand |
| AirQuality | Hourly | Pollution monitoring |
| SML | 15 minutes | Indoor sensing |
| Weather | 10 minutes | Meteorology |
| Solar-Energy | 1 hour | Solar generation |
| Traffic | 1 hour | Urban traffic |
| Electricity | 1 hour | Household usage |
Metrics include mean squared error (MSE), mean absolute error (MAE), and Pearson correlation (COR). FEATHer attained the best ranking in 60 cases (across 8 datasets and 4 horizons), with an average rank of $2.05$ versus $3.6$–$4.7$ for the next best baseline. On Solar-Energy, horizon, FEATHer’s MSE was $0.212$ compared to PatchTST’s $0.194$, but FEATHer used K parameters whereas PatchTST used M. Ablation studies demonstrated that each core FEATHer component contributed a $2$– improvement in MSE.
7. Deployment Implications for Edge Inference
FEATHer’s architectural minimalism enables real-time deployment on microcontrollers and PLCs. On ARM Cortex-M3, inference latency (for ) is ms, rising to $0.6$–$1.4$ ms for . Peak RAM usage is KB for single-channel input, and KB for eight-channel cases. For comparison, transformer models require $50$–$500$ KB activations and ms latency. This resource profile satisfies stringent requirements in industrial automation, embedded control, and energy-constrained settings, where on-chip SRAM and sub-millisecond latency are indispensable (Lee et al., 16 Jan 2026). A plausible implication is that FEATHer’s approach provides a template for scalable, data-efficient forecasting in safety-critical cyber-physical systems operating with on-chip resources.