- The paper introduces FLaG, a domain-agnostic pooling mechanism that transforms token representations into the frequency domain and applies latent attention gating.
- It demonstrates superior performance across tasks, notably reducing RMSE in antimicrobial peptide prediction and achieving top accuracy on CIFAR image benchmarks.
- It establishes frequency-domain processing with channel gating as a competitive alternative to traditional time-domain pooling methods for robust cross-domain token aggregation.
Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation
Motivation and Background
Token aggregation remains a critical bottleneck in neural architectures for protein analysis, computer vision, and natural language processing, where sample-level predictions require a fixed-length representation derived from variable-length token sequences. Traditional pooling operations—mean, max, last-token, and attention-based—operate exclusively in the time domain and broadly ignore spectral information. This oversight leads to homogenization and collapse in token representations, particularly problematic in sequence-level tasks such as antimicrobial peptide (AMP) prediction, where capturing localized and sequence-scale structure is essential. While protein-specific approaches employing evolutionary anchors offer improvements, they are domain-restricted and demand external context unavailable in many settings.
FLaG Architecture and Methodology
The proposed Frequency-Domain Latent Attention Gating (FLaG) module introduces a domain-agnostic pooling mechanism leveraging spectral signal decomposition and latent attention. The pipeline consists of four sequential stages:
- Frequency Transformation: Token representations are transformed via the real Fast Fourier Transform (rFFT) along the sequence axis, yielding frequency-domain tokens that concatenate real and imaginary components.
- Latent Attention in Frequency Space: Learnable latent queries attend to the frequency tokens using multi-head attention, resulting in a latent summary that acts as a frequency detector across spectral coordinates.
- Frequency Gating: The latent summary is mapped to a channel-wise gate through an MLP and sigmoid activation, modulating spectral amplitudes with a residual connection to allow for amplification or preservation of spectral components.
- Time-Domain Reconstruction and Pooling: Gated frequency tokens are inverse transformed back to the time domain (irFFT), followed by standard pooling (max by default) and linear projection for final aggregation.
Complexity analysis establishes a practical O(TDlogT+TLD) arithmetic cost, with the FFT step being the principal contributor compared to lightweight time-domain pooling.
Empirical Evaluation
Antimicrobial Peptide Activity Prediction
FLaG demonstrates superior performance in AMP activity prediction with ESM2 backbones (8M and 35M parameter variants). On ESM2-8M, FLaG achieves the lowest RMSE on both E. coli (0.562) and S. aureus (0.545), outperforming mean pooling and other baselines, especially under constrained model capacity. Recall@50 metrics show FLaG leads or is closely competitive, indicating improved candidate retrieval. On the larger ESM2-35M, FLaG remains competitive, with marginal RMSE superiority in E. coli but slightly trailing max pooling on S. aureus, highlighting interaction effects between backbone capacity and pooling efficacy.
Image Classification
ResNet18 evaluations on CIFAR-10/100 reveal FLaG attains the highest mean accuracy—96.01% (CIFAR-10) and 77.20% (CIFAR-100)—with reduced variance, offering enhanced stability across random seeds. The margin over strong baselines is most pronounced on CIFAR-100 (0.42% over the best alternative), demonstrating FLaG's efficacy and robustness in high-entropy vision tasks.
Language Classification
On IMDB sentiment and GLUE tasks with RoBERTa-base, FLaG matches or narrowly outperforms mean pooling in clean accuracy (94.08%), with reduced variance. Against RoBERTa-native sentence-level pooling, FLaG remains competitive, indicating frequency-domain pooling serves as a viable high-floor aggregation in text. On GLUE, FLaG's performance is within 0.32 points of the strongest method, with no evidence of broad superiority over established pooling strategies.
Comparative Analysis
Relative to latent attention (time-domain counterpart), FLaG consistently delivers lower RMSE in peptide prediction and higher accuracy in image tasks, with competitive results on text. The latent frequency transform and channel-wise gating contribute benefits that time-domain attention does not, indicating substantive architectural value beyond learnable queries alone.
Mechanistic Probes and Analysis
Five AMP-side probes elucidate the operational dynamics underpinning FLaG's gains:
- Sequence-Frequency Band Knockout: DCT-based analysis shows low-frequency bands dominate predictive contribution, with higher-frequency residuals being sample-specific.
- Gate Spectral Effect: Gate operation broadly amplifies the frequency spectrum while preserving low-frequency dominance—energy is concentrated in the lowest bands both pre- and post-gate.
- Single-Residue Knockout: FLaG and attention-based pooling enable differentiated position-wise response profiles, retaining localized residue information absent in mean pooling.
- Latent-Query Readout: Cross-attention patterns are sample-specific, with mild query-wise differentiation; latent queries somewhat downweight the purely DC component, avoiding collapse onto the highest-energy bin.
- Structure-Proxy Stratification: High helix-propensity peptides show stronger average band sensitivity, supporting spectral stratification of peptide structure without elucidating underlying biological mechanisms.
Ablation and Limitations
Ablation establishes FLaG's full block as the optimal variant in cross-domain pooling, particularly on S. aureus and CIFAR-100, with FFT+gate accounting for a sizeable fraction of the gains. Limitations include dependence on regular sampling and zero-padding, FFT-induced computational overhead, and variance in corruption-side robustness.
Implications and Future Directions
Practically, FLaG sets a precedent for integrating frequency-domain re-expression and gating in neural sequence aggregation, achieving tangible improvements in structure-sensitive tasks (AMP prediction, CIFAR-100). Theoretical implications extend to the role of spectral information as an internal reference frame, diverging from external anchor-based methodologies. Future work should investigate algorithmic efficiency (FFT alternatives for variable and irregular inputs), corruption-side robustness in vision/language domains, and deeper biological interpretation for protein/peptide settings. Expansion to non-sequential modalities or heterogeneous token structures may further generalize FLaG's aggregation bias.
Conclusion
FLaG advances the token aggregation paradigm by embedding learnable frequency-domain selection and gating, yielding domain-transferable performance benefits most clearly in structure-sensitive protein and vision tasks. While its strongest evidence is peptide-side, the robustness across domains establishes FLaG as a competitive aggregation module deserving further exploration in sequence-centric machine learning architectures (2606.08191).