Focus Your Attention (with Adaptive IIR Filters) (2305.14952v2)
Abstract: We present a new layer in which dynamic (i.e.,input-dependent) Infinite Impulse Response (IIR) filters of order two are used to process the input sequence prior to applying conventional attention. The input is split into chunks, and the coefficients of these filters are determined based on previous chunks to maintain causality. Despite their relatively low order, the causal adaptive filters are shown to focus attention on the relevant sequence elements. The new layer is grounded in control theory, and is shown to generalize diagonal state-space layers. The layer performs on-par with state-of-the-art networks, with a fraction of their parameters and with time complexity that is sub-quadratic with input size. The obtained layer is favorable to layers such as Heyna, GPT2, and Mega, both with respect to the number of parameters and the obtained level of performance on multiple long-range sequence problems.
- Arij Al Adel. 2022. Global memory transformer for processing long documents. In Advances in Neural Computation, Machine Learning, and Cognitive Research VI: Selected Papers from the XXIV International Conference on Neuroinformatics, October 17-21, 2022, Moscow, Russia, pages 343–352. Springer.
- Nuha A. S. Alwan and Zahir M. Hussain. 2022. Deep learning for robust adaptive inverse control of nonlinear dynamic systems: Improved settling time with an autoencoder. Sensors, 22(16).
- Learning feed-forward one-shot learners. In Advances in Neural Information Processing Systems, pages 523–531.
- Scaling transformer to 1m tokens and beyond with rmt. arXiv preprint arXiv:2304.11062.
- Mikhail S Burtsev. Memory transformer with hierarchical attention for long document processing.
- Principled weight initialization for hypernetworks. In Int. Conf. on Learning Representations.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052.
- Decision s4: Efficient sequence-based rl via state spaces layers. In The Eleventh International Conference on Learning Representations.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11.
- A mathematical framework for transformer circuits. Transformer Circuits Thread.
- A practical survey on faster and lighter transformers. ACM Computing Surveys.
- Simple hardware-efficient long convolutions for sequence modeling. arXiv preprint arXiv:2302.06646.
- Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy. PMLR.
- It’s raw! audio generation with state-space models. In International Conference on Machine Learning, pages 7616–7633. PMLR.
- On the parameterization and initialization of diagonal state space models. Advances in Neural Information Processing Systems, 35:35971–35983.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585.
- Ankit Gupta and Jonathan Berant. 2020. Gmat: Global memory augmentation for transformers. arXiv preprint arXiv:2006.03274.
- Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994.
- Simplifying and understanding state space models with diagonal linear rnns. arXiv preprint arXiv:2212.00768.
- Hypernetworks. arXiv preprint arXiv:1609.09106.
- Thomas Haubner and Walter Kellermann. 2022. Deep learning-based joint control of acoustic echo cancellation, beamforming and postfiltering.
- Hyperprompt: Prompt-based task-conditioning of transformers. In International Conference on Machine Learning, pages 8678–8690. PMLR.
- Block-recurrent transformers. arXiv preprint arXiv:2203.07852.
- Efficient movie scene detection using state-space transformers. arXiv preprint arXiv:2212.14427.
- Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299.
- Rudolph Emil Kalman. 1960. A new approach to linear filtering and prediction problems.
- Differentiable iir filters for machine learning applications. In Proc. Int. Conf. Digital Audio Effects (eDAFx-20), pages 297–303.
- What makes convolutional models great on long sequence modeling? arXiv preprint arXiv:2210.09298.
- Ilya Loshchilov and Frank Hutter. 2017. Fixing weight decay regularization in adam.
- Shahar Shlomo Lutati and Lior Wolf. 2023. Ocd: Learning to overfit with conditional diffusion models. In ICML.
- Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655.
- Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866.
- Pytorch. Adaptivemaxpool2d¶.
- Language models are unsupervised multitask learners.
- KalmanNet: Neural network aided kalman filtering for partially known dynamics. IEEE Transactions on Signal Processing, 70:1532–1547.
- Ckconv: Continuous kernel convolution for sequential data. arXiv preprint arXiv:2102.02611.
- Diagonal state space augmented transformers for speech recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Hypersound: Generating implicit neural representations of audio signals with hypernetworks. arXiv preprint arXiv:2211.01839.
- Long range arena: A benchmark for efficient transformers. arXiv preprint arXiv:2011.04006.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28.
- State spaces aren’t enough: Machine translation needs attention. ArXiv, abs/2304.12776.
- Attention is all you need. Advances in neural information processing systems, 30.
- Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695.
- Selective structured state-spaces for long-form video understanding. arXiv preprint arXiv:2303.14526.
- Pretraining without attention. arXiv preprint arXiv:2212.10544.
- Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- Lightweight and efficient end-to-end speech recognition using low-rank transformer. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6144–6148. IEEE.
- Simple local attentions remain competitive for long-context tasks. arXiv preprint arXiv:2112.07210.
- Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint arXiv:2305.07185.
- Graph hypernetworks for neural architecture search. In 7th International Conference on Learning Representations, ICLR 2019.
- Hao Zhang and DeLiang Wang. 2021. Deep anc: A deep learning approach to active noise control. Neural Networks, 141:1–10.
- Effectively modeling time series with simple discrete state spaces. arXiv preprint arXiv:2303.09489.
- AdaNN: Adaptive neural network-based equalizer via online semi-supervised learning. Journal of Lightwave Technology, 38(16):4315–4324.
- Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning, pages 27268–27286. PMLR.
- Efficient long sequence modeling via state space augmented transformer. arXiv preprint arXiv:2212.08136.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.