Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Smooth Is Attention? (2312.14820v2)

Published 22 Dec 2023 in cs.LG

Abstract: Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and expressive power - is incomplete. We provide a detailed study of the Lipschitz constant of self-attention in several practical scenarios, discussing the impact of the sequence length $n$ and layer normalization on the local Lipschitz constant of both unmasked and masked self-attention. In particular, we show that for inputs of length $n$ in any compact set, the Lipschitz constant of self-attention is bounded by $\sqrt{n}$ up to a constant factor and that this bound is tight for reasonable sequence lengths. When the sequence length $n$ is too large for the previous bound to be tight, which we refer to as the mean-field regime, we provide an upper bound and a matching lower bound which are independent of $n$. Our mean-field framework for masked self-attention is novel and of independent interest. Our experiments on pretrained and randomly initialized BERT and GPT-2 support our theoretical findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Cem Anil, James Lucas and Roger Grosse “Sorting out Lipschitz function approximation” In International Conference on Machine Learning, 2019, pp. 291–301 PMLR
  2. Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio “Neural machine translation by jointly learning to align and translate” In arXiv preprint arXiv:1409.0473, 2014
  3. Peter L Bartlett, Dylan J Foster and Matus J Telgarsky “Spectrally-normalized margin bounds for neural networks” In Advances in neural information processing systems 30, 2017
  4. “Invertible residual networks” In International conference on machine learning, 2019, pp. 573–582 PMLR
  5. Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas “Measure theory” Springer, 2007
  6. “JAX: composable transformations of Python+NumPy programs”, 2018 URL: http://github.com/google/jax
  7. “Language models are few-shot learners” In Advances in neural information processing systems 33, 2020, pp. 1877–1901
  8. “Towards evaluating the robustness of neural networks” In 2017 ieee symposium on security and privacy (sp), 2017, pp. 39–57 Ieee
  9. “Decision transformer: Reinforcement learning via sequence modeling” In Advances in neural information processing systems 34, 2021, pp. 15084–15097
  10. “Neural ordinary differential equations” In Advances in neural information processing systems 31, 2018
  11. “Residual flows for invertible generative modeling” In Advances in Neural Information Processing Systems 32, 2019
  12. “W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training”, 2021 arXiv:2108.06209 [cs.LG]
  13. “Parseval networks: Improving robustness to adversarial examples” In International conference on machine learning, 2017, pp. 854–863 PMLR
  14. Gwendoline De Bie, Gabriel Peyré and Marco Cuturi “Stochastic deep networks” In International Conference on Machine Learning, 2019, pp. 1556–1565 PMLR
  15. “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv:1810.04805, 2018
  16. “An image is worth 16x16 words: Transformers for image recognition at scale” In arXiv:2010.11929, 2020
  17. “Efficient and accurate estimation of lipschitz constants for deep neural networks” In Advances in Neural Information Processing Systems 32, 2019
  18. “A Mathematical Perspective on Transformers”
  19. “The emergence of clusters in self-attention dynamics” In arXiv preprint arXiv:2305.05465, 2023
  20. Ian J Goodfellow, Jonathon Shlens and Christian Szegedy “Explaining and harnessing adversarial examples” In arXiv preprint arXiv:1412.6572, 2014
  21. “CertViT: Certified Robustness of Pre-Trained Vision Transformers” In arXiv preprint arXiv:2302.10287, 2023
  22. “Formal guarantees on the robustness of a classifier against adversarial manipulation” In Advances in neural information processing systems 30, 2017
  23. Bamdad Hosseini, Alexander W Hsu and Amirhossein Taghvaei “Conditional Optimal Transport on Function Spaces” In arXiv preprint arXiv:2311.05672, 2023
  24. Michael Janner, Qiyang Li and Sergey Levine “Offline reinforcement learning as one big sequence modeling problem” In Advances in neural information processing systems 34, 2021, pp. 1273–1286
  25. “Revisiting and Exploring Efficient Fast Adversarial Training via LAW: Lipschitz Regularization and Auto Weight Averaging” In arXiv preprint arXiv:2308.11443, 2023
  26. Hyunjik Kim, George Papamakarios and Andriy Mnih “The Lipschitz Constant of Self-Attention”, 2021 arXiv:2006.04710 [stat.ML]
  27. Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “The CIFAR-10 dataset” In online: http://www. cs. toronto. edu/kriz/cifar. html 55.5, 2014
  28. Alexey Kurakin, Ian Goodfellow and Samy Bengio “Adversarial machine learning at scale” In arXiv preprint arXiv:1611.01236, 2016
  29. Fabian Latorre, Paul Rolland and Volkan Cevher “Lipschitz constant estimation of neural networks via sparse polynomial optimization” In arXiv preprint arXiv:2004.08688, 2020
  30. “Set transformer: A framework for attention-based permutation-invariant neural networks” In International conference on machine learning, 2019, pp. 3744–3753 PMLR
  31. Klas Leino, Zifan Wang and Matt Fredrikson “Globally-robust neural networks” In International Conference on Machine Learning, 2021, pp. 6212–6222 PMLR
  32. “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension” In arXiv:1910.13461, 2019
  33. “Generating wikipedia by summarizing long sequences” In arXiv:1801.10198, 2018
  34. “Roberta: A robustly optimized bert pretraining approach” In arXiv:1907.11692, 2019
  35. “Understanding and improving transformer from a multi-particle dynamic system point of view” In arXiv preprint arXiv:1906.02762, 2019
  36. “Distance-Based Classification with Lipschitz Functions.” In J. Mach. Learn. Res. 5.Jun, 2004, pp. 669–695
  37. “Distributional smoothing with virtual adversarial training” In arXiv preprint arXiv:1507.00677, 2015
  38. “Spectral normalization for generative adversarial networks” In arXiv preprint arXiv:1802.05957, 2018
  39. “Virtual adversarial training: a regularization method for supervised and semi-supervised learning” In IEEE transactions on pattern analysis and machine intelligence 41.8 IEEE, 2018, pp. 1979–1993
  40. Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi and Pascal Frossard “Deepfool: a simple and accurate method to fool deep neural networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2574–2582
  41. “Exploring generalization in deep learning” In Advances in neural information processing systems 30, 2017
  42. OpenAI “GPT-4 Technical Report”, 2023 arXiv:2303.08774 [cs.CL]
  43. “Distillation as a defense to adversarial perturbations against deep neural networks” In 2016 IEEE symposium on security and privacy (SP), 2016, pp. 582–597 IEEE
  44. “Approximation capability of neural networks on spaces of probability measures and tree-structured domains” In arXiv preprint arXiv:1906.00764, 2019
  45. “Computational optimal transport: With applications to data science” In Foundations and Trends® in Machine Learning 11.5-6 Now Publishers, Inc., 2019, pp. 355–607
  46. “Lipsformer: Introducing lipschitz continuity to vision transformers” In arXiv preprint arXiv:2304.09856, 2023
  47. “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
  48. “Robust speech recognition via large-scale weak supervision” In International Conference on Machine Learning, 2023, pp. 28492–28518 PMLR
  49. “Exploring the limits of transfer learning with a unified text-to-text transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
  50. “A case for new neural network smoothness constraints” PMLR, 2020
  51. “Sinkformers: Transformers with doubly stochastic attention” In International Conference on Artificial Intelligence and Statistics, 2022, pp. 3515–3530 PMLR
  52. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” In arXiv:1910.01108, 2019
  53. Filippo Santambrogio “Optimal transport for applied mathematicians” In Birkäuser, NY 55.58-63 Springer, 2015, pp. 94
  54. “Robust large margin deep neural networks” In IEEE Transactions on Signal Processing 65.16 IEEE, 2017, pp. 4265–4280
  55. Gilbert Strang “Calculus” SIAM, 1991
  56. “Intriguing properties of neural networks” In arXiv preprint arXiv:1312.6199, 2013
  57. Yusuke Tsuzuku, Issei Sato and Masashi Sugiyama “Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks” In Advances in neural information processing systems 31, 2018
  58. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  59. “Lipschitz regularity of deep neural networks: analysis and efficient estimation” In Advances in Neural Information Processing Systems 31, 2018
  60. James Vuckovic, Aristide Baratin and Rémi Tachet Combes “A Mathematical Theory of Attention” In ArXiv abs/2007.02876, 2020
  61. James Vuckovic, Aristide Baratin and Remi Tachet des Combes “On the regularity of attention” In arXiv preprint arXiv:2102.05628, 2021
  62. “Fairseq S2T: Fast speech-to-text modeling with fairseq” In arXiv preprint arXiv:2010.05171, 2020
  63. “Evaluating the robustness of neural networks: An extreme value theory approach” In arXiv preprint arXiv:1801.10578, 2018
  64. “Huggingface’s transformers: State-of-the-art natural language processing” In arXiv preprint arXiv:1910.03771, 2019
  65. “Mitigating Transformer Overconfidence via Lipschitz Regularization” In arXiv preprint arXiv:2306.06849, 2023
  66. “Scaling vision transformers” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12104–12113
  67. “Point transformer. arXiv” In arXiv preprint arXiv:2012.09164, 2020
  68. “A functional perspective on learning symmetric functions with neural networks” In International Conference on Machine Learning, 2021, pp. 13023–13032 PMLR
Citations (7)

Summary

We haven't generated a summary for this paper yet.