Transformers as Support Vector Machines (2308.16898v3)

Published 31 Aug 2023 in cs.LG, cs.AI, cs.CL, and math.OC

Abstract: Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through pairwise similarities computed as softmax$(XQK^\top X^\top)$, where $(K,Q)$ are the trainable key-query parameters. In this work, we establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem that separates optimal input tokens from non-optimal tokens using linear constraints on the outer-products of token pairs. This formalism allows us to characterize the implicit bias of 1-layer transformers optimized with gradient descent: (1) Optimizing the attention layer with vanishing regularization, parameterized by $(K,Q)$, converges in direction to an SVM solution minimizing the nuclear norm of the combined parameter $W=KQ^\top$. Instead, directly parameterizing by $W$ minimizes a Frobenius norm objective. We characterize this convergence, highlighting that it can occur toward locally-optimal directions rather than global ones. (2) Complementing this, we prove the local/global directional convergence of gradient descent under suitable geometric conditions. Importantly, we show that over-parameterization catalyzes global convergence by ensuring the feasibility of the SVM problem and by guaranteeing a benign optimization landscape devoid of stationary points. (3) While our theory applies primarily to linear prediction heads, we propose a more general SVM equivalence that predicts the implicit bias with nonlinear heads. Our findings are applicable to arbitrary datasets and their validity is verified via experiments. We also introduce several open problems and research directions. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.

PDF Abstract

Transformers as Support Vector Machines

The paper "Transformers as Support Vector Machines" explores the theoretical underpinnings of the transformer architecture, particularly focusing on the self-attention mechanism. It establishes a formal correspondence between the optimization dynamics of transformers, specifically the self-attention layer, and the classical framework of Support Vector Machines (SVMs), particularly the hard-margin SVM problem.

Main Contributions

SVM Equivalence: The authors reveal that the optimization landscape of the self-attention layer closely resembles a hard-margin SVM setup. This equivalence draws parallels between the attention mechanism’s ability to distinguish and prioritize specific input tokens and the SVM's capability of separating data points using a hyperplane. The self-attention operations, defined over pairs of token similarities, act similarly to feature mappings in SVM that classify data through linear constraints.
Implicit Bias and Gradient Descent: An important aspect of this paper is understanding how transformers are implicitly biased during training. The paper argues that transformers trained with gradient descent and diminishing regularization tend to align with an SVM solution characterized by the minimum nuclear norm when parameterized as a product of key-query matrices. Conversely, when the attention weights are directly optimized, they align with a solution minimizing the Frobenius norm.
Convergence Characteristics: The research demonstrates the convergence properties of gradient descent in optimizing the attention weights. Under certain assumptions, gradient descent converges directionally to an optimally-biased SVM solution. Factors like parameter initialization, over-parameterization, and input token geometry play crucial roles in determining whether convergence happens globally or aligns with local optima.
General SVM Framework for Nonlinear Heads: By extending beyond linear prediction heads to multilayer perceptrons (MLPs), the paper characterizes a broader SVM-like model. This model predicts how transformers might implicitly select multiple tokens by forming complex compositions influenced by nonlinear heads.
Experimental Validation: The authors validate theoretical findings through comprehensive numerical experiments. These experiments confirm the predictability of a transformer’s behavior using the SVM equivalence, both with linear and nonlinear prediction heads.

Implications and Future Directions

Improved Understanding of Attention Dynamics: This theoretical framework enhances the understanding of self-attention mechanisms, suggesting that attention layers inherently act as token-selecting classifiers—a perspective aligning closely with SVM methodology.
Optimization and Training Efficiency: The insights on convergence and implicit bias could inform better training strategies, potentially reducing computational complexity and improving training efficacy in large transformer models.
Generalization to Complex Architectures: Future research could extend these findings to multilayer and multi-head attention architectures, probing how hierarchical SVM-like processes emerge within complex transformers.
Cross-domain Applications: Given the parallels with SVM, these insights might generalize beyond NLP applications to computer vision and other domains where attention mechanisms are increasingly utilized.

In conclusion, the paper "Transformers as Support Vector Machines" provides a profound theoretical connection between the workings of transformer architectures and classical machine learning paradigms, promising to guide future developments in AI optimization and theory.