Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 39 tok/s Pro
GPT-5 Medium 15 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 155 tok/s Pro
GPT OSS 120B 476 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Focus Your Attention (with Adaptive IIR Filters) (2305.14952v2)

Published 24 May 2023 in cs.LG and eess.SP

Abstract: We present a new layer in which dynamic (i.e.,input-dependent) Infinite Impulse Response (IIR) filters of order two are used to process the input sequence prior to applying conventional attention. The input is split into chunks, and the coefficients of these filters are determined based on previous chunks to maintain causality. Despite their relatively low order, the causal adaptive filters are shown to focus attention on the relevant sequence elements. The new layer is grounded in control theory, and is shown to generalize diagonal state-space layers. The layer performs on-par with state-of-the-art networks, with a fraction of their parameters and with time complexity that is sub-quadratic with input size. The obtained layer is favorable to layers such as Heyna, GPT2, and Mega, both with respect to the number of parameters and the obtained level of performance on multiple long-range sequence problems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Arij Al Adel. 2022. Global memory transformer for processing long documents. In Advances in Neural Computation, Machine Learning, and Cognitive Research VI: Selected Papers from the XXIV International Conference on Neuroinformatics, October 17-21, 2022, Moscow, Russia, pages 343–352. Springer.
  2. Nuha A. S. Alwan and Zahir M. Hussain. 2022. Deep learning for robust adaptive inverse control of nonlinear dynamic systems: Improved settling time with an autoencoder. Sensors, 22(16).
  3. Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pages 523–531.
  4. Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062.
  5. Mikhail S Burtsev. Memory transformer with hierarchical attention for long document processing.
  6. Principled weight initialization for hypernetworks. In Int. Conf. on Learning Representations.
  7. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
  8. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  9. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  11. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052.
  12. Decision s4: Efficient sequence-based rl via state spaces layers. In The Eleventh International Conference on Learning Representations.
  13. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11.
  14. A mathematical framework for transformer circuits. Transformer Circuits Thread.
  15. A practical survey on faster and lighter transformers. ACM Computing Surveys.
  16. Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646.
  17. Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy. PMLR.
  18. It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633. PMLR.
  19. On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983.
  20. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
  21. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585.
  22. Ankit Gupta and Jonathan Berant. 2020. Gmat: Global memory augmentation for transformers. arXiv preprint arXiv:2006.03274.
  23. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994.
  24. Simplifying and understanding state space models with diagonal linear rnns. arXiv preprint arXiv:2212.00768.
  25. Hypernetworks. arXiv preprint arXiv:1609.09106.
  26. Thomas Haubner and Walter Kellermann. 2022. Deep learning-based joint control of acoustic echo cancellation, beamforming and postfiltering.
  27. Hyperprompt: Prompt-based task-conditioning of transformers. In International Conference on Machine Learning, pages 8678–8690. PMLR.
  28. Block-recurrent transformers. arXiv preprint arXiv:2203.07852.
  29. Efficient movie scene detection using state-space transformers. arXiv preprint arXiv:2212.14427.
  30. Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299.
  31. Rudolph Emil Kalman. 1960. A new approach to linear filtering and prediction problems.
  32. Differentiable iir filters for machine learning applications. In Proc. Int. Conf. Digital Audio Effects (eDAFx-20), pages 297–303.
  33. What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298.
  34. Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam.
  35. Shahar Shlomo Lutati and Lior Wolf. 2023. Ocd: Learning to overfit with conditional diffusion models. In ICML.
  36. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655.
  37. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947.
  38. Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349.
  39. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866.
  40. Pytorch. Adaptivemaxpool2d¶.
  41. Language models are unsupervised multitask learners.
  42. KalmanNet: Neural network aided kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70:1532–1547.
  43. Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv:2102.02611.
  44. Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  45. Hypersound: Generating implicit neural representations of audio signals with hypernetworks. arXiv preprint arXiv:2211.01839.
  46. Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006.
  47. Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28.
  48. State spaces aren’t enough: Machine translation needs attention. ArXiv, abs/2304.12776.
  49. Attention is all you need. Advances in neural information processing systems, 30.
  50. Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695.
  51. Selective structured state-spaces for long-form video understanding. arXiv preprint arXiv:2303.14526.
  52. Pretraining without attention. arXiv preprint arXiv:2212.10544.
  53. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  54. Lightweight and efficient end-to-end speech recognition using low-rank transformer. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6144–6148. IEEE.
  55. Simple local attentions remain competitive for long-context tasks. arXiv preprint arXiv:2112.07210.
  56. Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint arXiv:2305.07185.
  57. Graph hypernetworks for neural architecture search. In 7th International Conference on Learning Representations, ICLR 2019.
  58. Hao Zhang and DeLiang Wang. 2021. Deep anc: A deep learning approach to active noise control. Neural Networks, 141:1–10.
  59. Effectively modeling time series with simple discrete state spaces. arXiv preprint arXiv:2303.09489.
  60. AdaNN: Adaptive neural network-based equalizer via online semi-supervised learning. Journal of Lightwave Technology, 38(16):4315–4324.
  61. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning, pages 27268–27286. PMLR.
  62. Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136.
Citations (8)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a novel Focus layer that uses adaptive order-two IIR filters, dynamically generated by a hypernetwork, to pre-process inputs before local attention.
  • The methodology employs efficient frequency-domain filtering combined with chunked causal local attention, achieving near-linear complexity for long sequences.
  • Performance tests show that Focus matches or surpasses models like Hyena, MEGA, and GPT2 while using fewer parameters and lower peak memory.

This paper introduces Focus, a novel layer for sequence modeling that combines local attention with data-dependent Infinite Impulse Response (IIR) filters to efficiently handle long-range dependencies. The core idea is to pre-process the input sequence using adaptive filters before applying a standard, chunked local attention mechanism. This approach aims to overcome the limitations of traditional Transformers, such as their quadratic complexity with respect to sequence length and their often poor performance on tasks requiring very long contexts.

The Focus layer takes an input sequence xRL×Dx \in \mathbb{R}^{L \times D}, where LL is the sequence length and DD is the feature dimension. It consists of several key components:

  1. Adaptive IIR Filters: The central element is the use of dynamic (input-dependent) IIR filters of order two. These filters are applied globally along the sequence dimension. Unlike Finite Impulse Response (FIR) filters, IIR filters utilize feedback, allowing them to model longer dependencies with fewer parameters and achieve sharper frequency responses.
  2. Hypernetwork: The coefficients of the IIR filters are not fixed parameters but are dynamically generated for each input sequence by a hypernetwork HH. This hypernetwork processes the input sequence xx to produce the filter coefficients Θ\Theta. The hypernetwork itself uses a global convolution layer followed by adaptive max pooling to generate a sequence embedding, which is then mapped to the filter coefficients by a small Multi-Layer Perceptron (MLP).
  3. Causality: To maintain causality, which is important for auto-regressive tasks, the input sequence is split into non-overlapping "time bins" or chunks. The hypernetwork computes the filter coefficients for time bin ii based on information from time bin i1i-1 (by shifting the coefficients).
  4. Filtering in Frequency Domain: The IIR filtering operation is performed efficiently in the frequency domain. The input sequence, chunked into time bins, is transformed using the Fast Fourier Transform (FFT) for each bin. The adaptive IIR filter's frequency response is calculated based on the dynamically generated coefficients. Filtering is then achieved by element-wise multiplication of the input's frequency representation with the conjugated filter response. After filtering, an Inverse FFT (IFFT) transforms the signal back to the time domain.
  5. Chunking and Local Attention: The filtered sequence is split into non-overlapping chunks. Standard local self-attention with causal masking is applied independently to each chunk. This maintains computational efficiency by avoiding the quadratic cost of global attention.
  6. Gating Mechanism: Following the attention mechanism, the filtered input and the attention output are combined using a gating mechanism similar to that used in the MEGA model, involving sigmoid-weighted linear units (SiLU) and sigmoid gates for combining the inputs and outputs.

Implementation Details:

  • The order of the IIR filters is specifically chosen as two because higher orders can introduce stability issues, while order two filters, with positive coefficients restricted between 0 and 1 (achieved via a sigmoid activation in the hypernetwork's final layer), are guaranteed to be stable. The paper analyzes the frequency response and stability of these filters.
  • The filtering process involves:
    • Splitting input xx into time bins xrx_r.
    • Computing FFT for each xrx_r: X[ω,r]=FFT(xr)X[\omega, r] = \text{FFT}(x_r).
    • Hypernetwork H(x)H(x) generates coefficients Θ\Theta for NbinsNbins time bins, DD features, and FF filters, each of size 2.
    • Shift Θ\Theta for causality.
    • For each time bin rr and feature dd, apply filter: Xf[ω,r,d,f]=X[ω,r,d]IIRimp(ω,Θ[r1,d,f])X_f[\omega, r, d, f] = X[\omega, r, d] \cdot IIR_{imp}^*(\omega, \Theta[r-1, d, f]).
    • Sum over filter bank dimension FF: Xc=fXfX_c = \sum_f X_f.
    • Compute IFFT: xf=IFFT(Xc)x_f = \text{IFFT}(X_c).
  • The filtered sequence xfx_f is then chunked xfix_f^i and fed to local attention: yi=Atten(Qxi,Kxfi,Vxfi)y^i = \text{Atten}(Qx^i, Kx_f^i, Vx_f^i), where Q,K,VQ, K, V are learned projection matrices.
  • The final output oo is computed using update and reset gates from xfx_f and the attention output yy.

Analysis and Practical Implications:

  • Stability: The paper shows that with positive coefficients, order two IIR filters are stable, ensuring that the impulse response decays over time, a property identified as crucial for long-range modeling.
  • Expressiveness: The analysis demonstrates that diagonal State Space Models (SSMs) or diagonal linear RNNs, which have shown strong performance on long sequences, can be seen as constrained versions of order 1 or 2 IIR filters. This provides a theoretical link and suggests the potential expressiveness of IIR filters.
  • Time Complexity: The Focus layer achieves sub-quadratic time complexity O(DL(log(L)+M))O(DL \cdot (\log(L) + M)), where MM is the chunk size for local attention. This is dominated by the global convolution in the hypernetwork (O(LlogLD)O(L \log L \cdot D)) and the local attention (O(CM2D)O(CM^2D) which becomes O(LMD)O(LMD) since C=L/MC=L/M). If MM is small, this is close to linear in LL.
  • Efficiency and Performance: Experiments demonstrate that Focus performs on par with or better than state-of-the-art models like Hyena, MEGA, and GPT2 on long-range associative recall, LLMing (enwiki8, Text8), and 1D image classification (sMNIST, pMNIST). Crucially, it achieves this with significantly fewer parameters than models like GPT2 and outperforms ablations without the adaptive hypernetwork ("Focus-H"), highlighting the value of dynamic filtering. It also shows lower peak memory usage compared to baselines.

Limitations:

  • The paper focuses on integrating IIR filters into a hybrid architecture and does not explore models based solely on IIR filters.
  • Different types of global convolutions within the hypernetwork were not extensively studied.
  • While the recurrent form of IIR filters can offer efficiency for auto-regressive inference, the presented implementation focuses on the convolutional form for training and does not detail the recurrent computation for inference speed-up.

In summary, Focus offers a practical and effective approach to building efficient long-sequence models by introducing a novel layer that employs data-dependent, stable IIR filters guided by a hypernetwork to process inputs before local attention. Its competitive performance and efficiency across various long-range tasks make it a promising alternative to existing architectures.