Fourier-Attentive Representation Learning

Updated 7 December 2025

FARL is a framework that embeds Fourier feature mappings and spectral kernels into attention models to enhance expressivity and disentangle data variations.
It combines learnable Fourier mappings, spatially-varying masks, and kernelized attention mechanisms to adapt similarity structures and improve inductive biases.
Empirical results demonstrate improved accuracy, faster convergence, and enhanced generalization across vision, language, and implicit neural representation tasks.

Fourier-Attentive Representation Learning (FARL) refers to a family of neural architectures and representational frameworks that leverage frequency-domain decomposition, Fourier integral kernels, or learned Fourier feature mappings within attention-based, neural, or implicit representation models. These models span applications from spatial positional encoding for transformers and implicit neural representations to nonparametric kernel score-parameterization of event histories and few-shot vision–language adaptation. Across its main instantiations, FARL aims to enhance expressivity, disentangle modes of variation, or adapt similarity structures by integrating data-driven or learnable frequency-domain mechanisms into attentional or self-attentive pipelines.

1. Core Principles and Mathematical Foundations

FARL systems are unified by the principle of embedding Fourier-theoretic structures—either features, kernels, or attention masks—into a representation learning pipeline, typically modulated by neural mechanisms. Key mathematical elements include:

Learnable Fourier Feature Mappings: A multidimensional input $p\in\mathbb{R}^M$ is encoded as

$\phi(p) = [\sin(2\pi Bp);\ \cos(2\pi Bp)]\in\mathbb{R}^D,$

with a learnable frequency matrix $B\in\mathbb{R}^{(D/2)\times M}$ , generalizing random Fourier features to trainable bases (Li et al., 2021).

Modulation and Nonlinear Mixing: Raw Fourier features are passed through learnable mechanisms (e.g., MLPs) to yield final encodings $h(p) = \mathrm{MLP}_\theta(\phi(p))$ or to parameterize spatially-varying masks $M_k(x)$ (Li et al., 2021, Li et al., 2023).
Spectral/Kernalized Attention: FARL-based attention kernels take the form of generalized Fourier integrals,

$K_R(q,k) = \prod_{d=1}^D \varphi\left(\frac{\sin(R(q_d - k_d))}{R(q_d - k_d)}\right),$

where $\varphi$ is a bounded polynomial weighting function, giving rise to theoretically well-grounded nonparametric regression operators for attention (Nguyen et al., 2022).

Fourier Decomposition in Vision–LLMs: In recent extensions, FARL employs explicit Fourier analysis to decompose image representations into structural (phase) and stylistic (amplitude) components, which are separately queried by learnable concept tokens using dual cross-attention (Pham et al., 4 Dec 2025).

These mathematical innovations allow FARL systems to encode inductive biases regarding translation invariance, scale, spatial structure, or spectral locality, and to disentangle different factors of variation in complex data.

2. Architectures and Mechanisms

The architectural implementations of FARL are diverse and tailored to the intended task:

Positional Encoding in Transformers: FARL replaces fixed or tabular positional encodings with learnable Fourier-based mappings, modulated by compact MLPs. These encodings are integrated with content embeddings and fed through the transformer’s query/key/value pipeline, providing inductive bias for spatially-structured inputs (Li et al., 2021).
Spatially Masked Fourier Bases in INRs: For implicit neural representation (INR), FARL introduces per-basis, soft spatial masks $M_k(x)$ applied to each Fourier basis $\phi_k(x)$ , yielding outputs

$f(x) = \sum_{k=1}^K M_k(x) \phi_k(x) \theta_k,$

and “collaging” frequency patches as dictated by data-dependent attention (Li et al., 2023).

Fourier Integral-Kernel Attention: In FourierFormer, standard attention coefficients are replaced with generalized Fourier kernels, providing a nonparametric and theoretically universal kernel regression within each attention head (Nguyen et al., 2022).
Adaptive Fourier Similarity in Event Models: In point process models, FARL replaces the dot-product attention with a flexible shift-invariant Fourier kernel, whose spectral density is parameterized via a neural generator. Random Fourier features are sampled from this learned spectrum, enabling data-adaptive similarity (Zhu et al., 2020).
Frequency-Disentangled Vision–Language Adaptation: FARL for few-shot vision–LLMs employs 2D FFT to separate images into amplitude (style) and phase (structure). Dual cross-attention streams augment learnable tokens, which are then injected via asymmetric adapters into deep model layers, supported by disentanglement losses and class/representation-level regularization (Pham et al., 4 Dec 2025).

3. Experimental Results and Empirical Properties

FARL-based approaches consistently report improved accuracy, robustness, and convergence speed across diverse domains:

Vision Models: In ViT-B/16 on ImageNet, learnable Fourier feature encodings improve top-1 accuracy from 73.6% to 74.5% (Li et al., 2021).
Image Generation: Reformer with FARL on ImageNet 64×64 achieves lower bits-per-dimension (bpd) and twice the convergence speed of 2D embedding baselines (e.g., 3.91 vs 3.97 bpd at 100k steps) (Li et al., 2021).
Object Detection: In DETR on COCO, learnable Fourier + MLP encodings yield higher AP and increased robustness to unseen image sizes (Li et al., 2021).
Implicit Neural Representation: On Kodak image-fitting, FARL’s collaged spatial basis achieves over +3 dB PSNR improvement (approx. 40 dB vs. 37 dB with SIREN), as well as superior IoU and Chamfer distance in 3D shape tasks (Li et al., 2023).

Method & Domain	Metrics Improved	Magnitude of Gain
Reformer + FARL (ImageNet 64)	Bits per dimension, speed	−0.06 bpd, ×2 speed
FARL (SCONE, Kodak images)	PSNR	+3 dB
FourierFormer (LM, WikiText-103)	Perplexity	~−1.5
Vision–Language FARL (16-shot)	Harmonic mean (11 datasets)	81.6% vs 80.7% (MMRL)

Notably, in few-shot vision–language adaptation, FARL achieves the highest harmonic mean across base and novel class accuracies (81.6%), a +0.9% improvement over MMRL, and outperforms entanglement-based baselines by +10% on EuroSAT novel classes (Pham et al., 4 Dec 2025).

Ablation studies demonstrate that phase-spectrum streams chiefly drive generalization, while amplitude (style) cues complement base-class performance (Pham et al., 4 Dec 2025).

4. Theoretical Guarantees and Properties

FARL’s use of Fourier-based elements introduces new theoretical guarantees:

Kernel Universality: Generalized Fourier integral kernels provide universal approximators for attention, capturing any key/set distribution under smoothness conditions and offering explicit convergence rates for density and regression estimation (Nguyen et al., 2022).
Shift-Invariance: FARL kernels and encoding maps with learned spectra or frequency matrices allow inductive biases over $\ell_2$ (Euclidean) distances, generalizing the benefits of RBF kernel approximations to spatial and temporal sequence modeling (Li et al., 2021, Zhu et al., 2020).
Parameter Efficiency and Extrapolation: By employing parameter-tying and trainable frequency bases rather than large embedding tables, FARL architectures realize $\mathcal{O}(MD)$ parameter costs and continue to generalize to unseen or out-of-sample positional or frequency queries (Li et al., 2021, Li et al., 2023).
Positive Definite Score Functions: In event modeling, neural-parameterized spectra guarantee that the Fourier attention kernel remains positive definite—by Bochner’s theorem—supporting stable, interpretable, and nonparametric similarity modeling over histories (Zhu et al., 2020).

5. Computational Cost and Implementation Considerations

Runtime and Memory: Modest computational overhead is incurred for sine, cosine, and small matrix multiplications in position encodings or Fourier mask computation; speed decrease is documented as minor (e.g., steps/s from 1.8 to 1.2) relative to benefits (Li et al., 2021).
Parameter Count: Typical FARL front-ends remain lightweight (e.g., $\approx$ 66K parameters for full position encoding front-end, compared to 60–100M in the overall transformer) (Li et al., 2021).
Fourier Attention Kernels in Transformers: Generalized Fourier integral kernels are implemented as fused CUDA/C++ operators, with evaluation cost comparable to standard dot-product attention (both $\mathcal{O}(N^2D)$ ) (Nguyen et al., 2022).
Spectral Sampling in Point Processes: Neural generators allow efficient sampling of spectral frequencies for random feature embeddings, with sample counts (e.g., $D=20$ per mini-batch) sufficient for near-uniform approximation (Zhu et al., 2020).

6. Extensions, Limitations, and Future Directions

Potential extensions and open directions for FARL frameworks include:

Learnable Frequencies and Masking: Joint optimization of both basis functions and spatial masks, possibly adding regularization (sparsity, total-variation) on mask fields to improve locality and sharpness (Li et al., 2023).
Hybrid and Deep Combination: Mixing absolute and relative Fourier encodings, deploying additional MLP or convolutional front-ends over frequency features, or extending to 3D and spatio-temporal domains (Li et al., 2021).
Richer Spectrum Parameterizations: Use of normalizing flows or other neural generators to model complex, multimodal, or nonstationary spectral density functions (Zhu et al., 2020).
Applicability to Diverse Domains: Extension to marked point processes with auxiliary modalities (e.g., images/text), or to non-Euclidean geometries and data structures, such as manifolds or graphs (Li et al., 2023, Zhu et al., 2020).
Attention Head Diversification: Empirically observed reduction in redundancy among attention heads with Fourier kernels points to possible architectural synergies for diversity and expressivity (Nguyen et al., 2022).
Vision–Language Integration: Asymmetric and late-stage token injection of disentangled Fourier representations into multiple encoder pathways for robust few-shot adaptation and domain generalization (Pham et al., 4 Dec 2025).

A remaining limitation is reliance on local MLP features for mask or feature extraction, prompting investigation into fully convolutional or hierarchical transformer-based mask generators for additional efficiency in large-scale or spatially-complex settings (Li et al., 2023).

FARL extends the functionality and flexibility of neural architectures by integrating frequency-domain reasoning and adaptive spectral parameterization deeply into attention-based and implicit models. Its evidential efficacy and theoretical robustness render it a central paradigm in current representation learning research across vision, sequence modeling, generative modeling, and beyond (Li et al., 2021, Li et al., 2023, Nguyen et al., 2022, Zhu et al., 2020, Pham et al., 4 Dec 2025).