Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Implicit Regularization of Gradient Flow on One-Layer Softmax Attention (2403.08699v1)

Published 13 Mar 2024 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: We study gradient flow on the exponential loss for a classification problem with a one-layer softmax attention model, where the key and query weight matrices are trained separately. Under a separability assumption on the data, we show that when gradient flow achieves the minimal loss value, it further implicitly minimizes the nuclear norm of the product of the key and query weight matrices. Such implicit regularization can be described by a Support Vector Machine (SVM) problem with respect to the attention weights. This finding contrasts with prior results showing that the gradient descent induces an implicit regularization on the Frobenius norm on the product weight matrix when the key and query matrices are combined into a single weight matrix for training. For diagonal key and query matrices, our analysis builds upon the reparameterization technique and exploits approximate KKT conditions of the SVM associated with the classification data. Moreover, the results are extended to general weights configurations given proper alignment of the weight matrices' singular spaces with the data features at initialization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255.
  3. Anthropic (2023). Model card and evaluations for claude models.
  4. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32.
  5. Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. arXiv preprint arXiv:1806.00952.
  6. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. In International Conference on Machine Learning. PMLR.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33 1877–1901.
  8. Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. arXiv preprint arXiv:2402.19442.
  9. Rethinking attention with performers. arXiv preprint arXiv:2009.14794.
  10. More is less: inducing sparsity via overparameterization. Information and Inference: A Journal of the IMA, 12 iaad012.
  11. On the optimization and generalization of multi-head attention. arXiv preprint arXiv:2310.12680.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  13. Approximate kkt points and a proximity measure for termination. Journal of Global Optimization, 56 1463–1499.
  14. (s) gd over diagonal linear networks: Implicit bias, large stepsizes and edge of stability. Advances in Neural Information Processing Systems, 36.
  15. Understanding implicit regularization in over-parameterized single index model. Journal of the American Statistical Association, 118 2315–2328.
  16. Implicit regularization of discrete gradient dynamics in linear neural networks. Advances in Neural Information Processing Systems, 32.
  17. Characterizing implicit bias in terms of optimization geometry. In International Conference on Machine Learning. PMLR.
  18. Implicit bias of gradient descent on linear convolutional networks. Advances in neural information processing systems, 31.
  19. Implicit regularization in matrix factorization. Advances in neural information processing systems, 30.
  20. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory. PMLR.
  21. In-context convergence of transformers. arXiv preprint arXiv:2310.05249.
  22. Transformers provably learn feature-position correlations in masked image modeling. arXiv preprint arXiv:2403.02233.
  23. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35 37822–37836.
  24. Fast margin maximization via dual acceleration. In International Conference on Machine Learning. PMLR.
  25. Gradient descent aligns the layers of deep linear networks. arXiv preprint arXiv:1810.02032.
  26. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33 17176–17186.
  27. Characterizing the implicit bias via a primal-dual analysis. In Algorithmic Learning Theory. PMLR.
  28. Transformers in vision: A survey. ACM computing surveys (CSUR), 54 1–41.
  29. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. arXiv preprint arXiv:2302.06015.
  30. Implicit sparse regularization: The impact of depth and early stopping. Advances in Neural Information Processing Systems, 34 28298–28309.
  31. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245.
  32. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory. PMLR.
  33. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. arXiv preprint arXiv:2012.09839.
  34. What happens after sgd reaches zero loss?–a mathematical framework. arXiv preprint arXiv:2110.06914.
  35. What happens after SGD reaches zero loss? –a mathematical framework. In International Conference on Learning Representations.
  36. Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. Advances in Neural Information Processing Systems, 35 34626–34640.
  37. Gradient descent maximizes the margin of homogeneous neural networks. arXiv preprint arXiv:1906.05890.
  38. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Advances in neural information processing systems, 33 22182–22193.
  39. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning. PMLR.
  40. Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics. PMLR.
  41. How transformers learn causal structure with gradient descent. arXiv preprint arXiv:2402.14735.
  42. On the role of attention in prompt-tuning. arXiv preprint arXiv:2306.03435.
  43. Saddle-to-saddle dynamics in diagonal linear networks. Advances in Neural Information Processing Systems, 36.
  44. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19 2822–2878.
  45. Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Advances in Neural Information Processing Systems, 34 23831–23843.
  46. Transformers as support vector machines. arXiv preprint arXiv:2308.16898.
  47. Max-margin token selection in attention mechanism. In Thirty-seventh Conference on Neural Information Processing Systems.
  48. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  49. Thrampoulidis, C. (2024). Implicit bias of next-token prediction. arXiv preprint arXiv:2402.18551.
  50. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. Advances in Neural Information Processing Systems, 36.
  51. Implicit regularization for optimal sparse recovery. Advances in Neural Information Processing Systems, 32.
  52. Implicit bias and fast convergence rates for self-attention. arXiv preprint arXiv:2402.05738.
  53. Attention is all you need. Advances in neural information processing systems, 30.
  54. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
  55. Watson, G. A. (1992). Characterization of the subdifferential of some matrix norms. Linear Algebra Appl, 170 33–45.
  56. Implicit regularization in ai meets generalized hardness of approximation in optimization–sharp results for diagonal linear networks. arXiv preprint arXiv:2307.07410.
  57. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory. PMLR.
  58. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision.
  59. A unifying view on implicit bias in training linear neural networks. arXiv preprint arXiv:2010.02501.
  60. A unifying view on implicit bias in training linear neural networks. In International Conference on Learning Representations.
  61. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927.
  62. Implicit regularization leads to benign overfitting for sparse linear regression. arXiv preprint arXiv:2302.00257.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com