How Smooth Is Attention? (2312.14820v2)
Abstract: Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.
- Cem Anil, James Lucas and Roger Grosse “Sorting out Lipschitz function approximation” In International Conference on Machine Learning, 2019, pp. 291–301 PMLR
- Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio “Neural machine translation by jointly learning to align and translate” In arXiv preprint arXiv:1409.0473, 2014
- Peter L Bartlett, Dylan J Foster and Matus J Telgarsky “Spectrally-normalized margin bounds for neural networks” In Advances in neural information processing systems 30, 2017
- “Invertible residual networks” In International conference on machine learning, 2019, pp. 573–582 PMLR
- Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas “Measure theory” Springer, 2007
- “JAX: composable transformations of Python+NumPy programs”, 2018 URL: http://github.com/google/jax
- “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
- “Towards evaluating the robustness of neural networks” In 2017 ieee symposium on security and privacy (sp), 2017, pp. 39–57 Ieee
- “Decision transformer: Reinforcement learning via sequence modeling” In Advances in neural information processing systems 34, 2021, pp. 15084–15097
- “Neural ordinary differential equations” In Advances in neural information processing systems 31, 2018
- “Residual flows for invertible generative modeling” In Advances in Neural Information Processing Systems 32, 2019
- “W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training”, 2021 arXiv:2108.06209 [cs.LG]
- “Parseval networks: Improving robustness to adversarial examples” In International conference on machine learning, 2017, pp. 854–863 PMLR
- Gwendoline De Bie, Gabriel Peyré and Marco Cuturi “Stochastic deep networks” In International Conference on Machine Learning, 2019, pp. 1556–1565 PMLR
- “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv:1810.04805, 2018
- “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv:2010.11929, 2020
- “Efficient and accurate estimation of lipschitz constants for deep neural networks” In Advances in Neural Information Processing Systems 32, 2019
- “A Mathematical Perspective on Transformers”
- “The emergence of clusters in self-attention dynamics” In arXiv preprint arXiv:2305.05465, 2023
- Ian J Goodfellow, Jonathon Shlens and Christian Szegedy “Explaining and harnessing adversarial examples” In arXiv preprint arXiv:1412.6572, 2014
- “CertViT: Certified Robustness of Pre-Trained Vision Transformers” In arXiv preprint arXiv:2302.10287, 2023
- “Formal guarantees on the robustness of a classifier against adversarial manipulation” In Advances in neural information processing systems 30, 2017
- Bamdad Hosseini, Alexander W Hsu and Amirhossein Taghvaei “Conditional Optimal Transport on Function Spaces” In arXiv preprint arXiv:2311.05672, 2023
- Michael Janner, Qiyang Li and Sergey Levine “Offline reinforcement learning as one big sequence modeling problem” In Advances in neural information processing systems 34, 2021, pp. 1273–1286
- “Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging” In arXiv preprint arXiv:2308.11443, 2023
- Hyunjik Kim, George Papamakarios and Andriy Mnih “The Lipschitz Constant of Self-Attention”, 2021 arXiv:2006.04710 [stat.ML]
- Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “The CIFAR-10 dataset” In online: http://www. cs. toronto. edu/kriz/cifar. html 55.5, 2014
- Alexey Kurakin, Ian Goodfellow and Samy Bengio “Adversarial machine learning at scale” In arXiv preprint arXiv:1611.01236, 2016
- Fabian Latorre, Paul Rolland and Volkan Cevher “Lipschitz constant estimation of neural networks via sparse polynomial optimization” In arXiv preprint arXiv:2004.08688, 2020
- “Set transformer: A framework for attention-based permutation-invariant neural networks” In International conference on machine learning, 2019, pp. 3744–3753 PMLR
- Klas Leino, Zifan Wang and Matt Fredrikson “Globally-robust neural networks” In International Conference on Machine Learning, 2021, pp. 6212–6222 PMLR
- “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension” In arXiv:1910.13461, 2019
- “Generating wikipedia by summarizing long sequences” In arXiv:1801.10198, 2018
- “Roberta: A robustly optimized bert pretraining approach” In arXiv:1907.11692, 2019
- “Understanding and improving transformer from a multi-particle dynamic system point of view” In arXiv preprint arXiv:1906.02762, 2019
- “Distance-Based Classification with Lipschitz Functions.” In J. Mach. Learn. Res. 5.Jun, 2004, pp. 669–695
- “Distributional smoothing with virtual adversarial training” In arXiv preprint arXiv:1507.00677, 2015
- “Spectral normalization for generative adversarial networks” In arXiv preprint arXiv:1802.05957, 2018
- “Virtual adversarial training: a regularization method for supervised and semi-supervised learning” In IEEE transactions on pattern analysis and machine intelligence 41.8 IEEE, 2018, pp. 1979–1993
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi and Pascal Frossard “Deepfool: a simple and accurate method to fool deep neural networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582
- “Exploring generalization in deep learning” In Advances in neural information processing systems 30, 2017
- OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
- “Distillation as a defense to adversarial perturbations against deep neural networks” In 2016 IEEE symposium on security and privacy (SP), 2016, pp. 582–597 IEEE
- “Approximation capability of neural networks on spaces of probability measures and tree-structured domains” In arXiv preprint arXiv:1906.00764, 2019
- “Computational optimal transport: With applications to data science” In Foundations and Trends® in Machine Learning 11.5-6 Now Publishers, Inc., 2019, pp. 355–607
- “Lipsformer: Introducing lipschitz continuity to vision transformers” In arXiv preprint arXiv:2304.09856, 2023
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- “Robust speech recognition via large-scale weak supervision” In International Conference on Machine Learning, 2023, pp. 28492–28518 PMLR
- “Exploring the limits of transfer learning with a unified text-to-text transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
- “A case for new neural network smoothness constraints” PMLR, 2020
- “Sinkformers: Transformers with doubly stochastic attention” In International Conference on Artificial Intelligence and Statistics, 2022, pp. 3515–3530 PMLR
- “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” In arXiv:1910.01108, 2019
- Filippo Santambrogio “Optimal transport for applied mathematicians” In Birkäuser, NY 55.58-63 Springer, 2015, pp. 94
- “Robust large margin deep neural networks” In IEEE Transactions on Signal Processing 65.16 IEEE, 2017, pp. 4265–4280
- Gilbert Strang “Calculus” SIAM, 1991
- “Intriguing properties of neural networks” In arXiv preprint arXiv:1312.6199, 2013
- Yusuke Tsuzuku, Issei Sato and Masashi Sugiyama “Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks” In Advances in neural information processing systems 31, 2018
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- “Lipschitz regularity of deep neural networks: analysis and efficient estimation” In Advances in Neural Information Processing Systems 31, 2018
- James Vuckovic, Aristide Baratin and Rémi Tachet Combes “A Mathematical Theory of Attention” In ArXiv abs/2007.02876, 2020
- James Vuckovic, Aristide Baratin and Remi Tachet des Combes “On the regularity of attention” In arXiv preprint arXiv:2102.05628, 2021
- “Fairseq S2T: Fast speech-to-text modeling with fairseq” In arXiv preprint arXiv:2010.05171, 2020
- “Evaluating the robustness of neural networks: An extreme value theory approach” In arXiv preprint arXiv:1801.10578, 2018
- “Huggingface’s transformers: State-of-the-art natural language processing” In arXiv preprint arXiv:1910.03771, 2019
- “Mitigating Transformer Overconfidence via Lipschitz Regularization” In arXiv preprint arXiv:2306.06849, 2023
- “Scaling vision transformers” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12104–12113
- “Point transformer. arXiv” In arXiv preprint arXiv:2012.09164, 2020
- “A functional perspective on learning symmetric functions with neural networks” In International Conference on Machine Learning, 2021, pp. 13023–13032 PMLR