Transformer Learns Optimal Variable Selection in Group-Sparse Classification (2504.08638v1)

Published 11 Apr 2025 in stat.ML and cs.LG

Abstract: Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic statistical model with "group sparsity", where the input variables form multiple groups, and the label only depends on the variables from one of the groups. We theoretically demonstrate that, a one-layer transformer trained by gradient descent can correctly leverage the attention mechanism to select variables, disregarding irrelevant ones and focusing on those beneficial for classification. We also demonstrate that a well-pretrained one-layer transformer can be adapted to new downstream tasks to achieve good prediction accuracy with a limited number of samples. Our study sheds light on how transformers effectively learn structured data.

PDF Abstract

This paper investigates how a standard one-layer Transformer architecture, trained with gradient descent, can learn to perform optimal variable selection in a specific type of classification task known as "group-sparse linear classification".

Problem Setting: Group Sparse Classification

Data Generation: The input features $\Xb \in \mathbb{R}^{d \times D}$ consist of $D$ groups, each with $d$ features. Each group $\xb_j$ is drawn independently from a Gaussian distribution $N(\mathbf{0}, \sigma_x^2 \mathbf{I}_d)$ .
Labeling: The true label $y \in \{-1, +1\}$ depends only on the features from a single, predefined "label-relevant" group $j^*$ . Specifically, $y = \text{sign}(\langle \xb_{j^*}, \vb^* \rangle)$, where $\vb^* \in \mathbb{R}^d$ is a fixed ground-truth direction vector (normalized to $\|\vb^*\|_2=1$). All other groups $j \neq j^*$ are irrelevant to the label.
Input Representation: Each feature group $\xb_j$ is concatenated with a unique positional encoding $\pb_j \in \mathbb{R}^D$ (using orthogonal sine functions) to form the input token $\zb_j = [\xb_j^\top, \pb_j^\top]^\top \in \mathbb{R}^{d+D}$. The full input sequence is $\Zb = [\zb_1, \dots, \zb_D]$.

Model and Training

Architecture: A simplified single-head, one-layer self-attention Transformer is used: $f(\Zb, \Wb, \vb) = \sum_{j=1}^{D} \vb^\top \Zb \mathcal{S}(\Zb^\top \Wb \zb_j) = \vb^\top \Zb \mathbf{S} \mathbf{1}_D$.
- $\Wb \in \mathbb{R}^{(d+D) \times (d+D)}$ is a trainable matrix combining query and key transformations.
- $\vb \in \mathbb{R}^{d+D}$ is a trainable value vector.
- $\mathcal{S}(\cdot)$ is the softmax function applied column-wise, and $\mathbf{S} \in \mathbb{R}^{D \times D}$ is the resulting attention score matrix where $\mathbf{S}_{j', j}$ represents the attention paid by token $j$ to token $j'$ .
Training: The model parameters $(\Wb, \vb)$ are trained jointly using gradient descent on the population cross-entropy loss $\mathcal{L}(\vb, \Wb) = \mathbb{E}_{(\Xb, y)\sim \mathcal{D}}[\log(1+\exp(-y \cdot f(\Zb, \Wb, \vb)))]$. Training starts from zero initialization ($\Wb^{(0)}=\mathbf{0}, \vb^{(0)}=\mathbf{0}$) with a shared learning rate $\eta$ .

Key Theoretical Findings (Theorem 3.1 & Section 5)

Under mild conditions (e.g., $D$ sufficiently large relative to the desired loss tolerance $\epsilon$ ), the paper proves that gradient descent successfully trains the Transformer to learn the group-sparse structure:

Optimal Variable Selection via Attention: The attention mechanism learns to isolate the relevant group $j^*$ . After sufficient training iterations ( $T^*$ ), the attention scores satisfy $\mathbf{S}_{j^*, j}^{(T^*)} \approx 1$ and $\mathbf{S}_{j', j}^{(T^*)} \approx 0$ for $j' \neq j^*$ , with high probability for any input $\Zb$. This means the model effectively "attends" only to the features from the correct group $j^*$ .
Value Vector Alignment: The trainable value vector $\vb$ aligns correctly:
- The first block $\vb_1 \in \mathbb{R}^d$ (corresponding to features) aligns its direction with the ground-truth vector $\vb^*$, i.e., $\vb_1^{(T^*)} / \|\vb_1^{(T^*)}\|_2 \approx \vb^*$.
- The second block $\vb_2 \in \mathbb{R}^D$ (corresponding to positional encodings) remains approximately zero, $\vb_2^{(T^*)} \approx \mathbf{0}$. This ensures positional information is used for attention calculation but not directly included in the final output prediction.
Loss Convergence: The population cross-entropy loss $\mathcal{L}(\vb^{(T^*)}, \Wb^{(T^*)})$ converges to be arbitrarily small (below $\epsilon$ , bounded by $1/D^2$ ), with tight upper and lower bounds provided on the convergence rate.

Mechanism Explained (Proof Sketch - Section 5)

The paper provides insights into how this learning happens by analyzing the structure of the learned weight matrix $\Wb^{(T^*)}$:

Learned $\Wb$ Structure: Gradient descent drives $\Wb$ towards a specific block structure:
- $\Wb_{1,2}^{(T^*)} \approx \mathbf{0}$ and $\Wb_{2,1}^{(T^*)} \approx \mathbf{0}$: No direct interaction between feature vectors and positional encodings in the attention score calculation.
- $\Wb_{1,1}^{(T^*)}$ (feature-feature interaction) aligns primarily with $\vb^* \vb^{*\top}$.
- $\Wb_{2,2}^{(T^*)}$ (position-position interaction) develops a specific low-rank structure related to $(\pb_{j^*} - \pb_j)$ terms.
Attention Focus: Due to the orthogonality of the chosen positional encodings $\pb_j$ and the learned structure of $\Wb_{2,2}^{(T^*)}$, the position-position interaction term $\pb_{j'}^\top \Wb_{2,2}^{(T^*)} \pb_j$ becomes significantly larger when $j'=j^*$ compared to other $j'$ . This term dominates the softmax calculation, causing $\mathbf{S}_{j^*, j}^{(T^*)}$ to approach 1. The feature interaction term $\xb_{j'}^\top \Wb_{1,1}^{(T^*)} \xb_j$ is shown to be smaller in magnitude.

Transfer Learning Application (Section 4, Theorem 4.1)

The paper demonstrates the practical benefit of this learned structure for transfer learning:

Downstream Task: Consider a new classification task with the same group sparsity pattern ( $j^*$ is the same) but potentially different data distribution (sub-Gaussian features, linear separability margin $\gamma$ ).
Fine-tuning: Initialize a new model with the pre-trained $\Wb^{(T^*)}$ (denoted $\tilde{\Wb}^{(0)}$) and $\tilde{\vb}^{(0)}=\mathbf{0}$. Fine-tune using online SGD on $n$ samples from the downstream task.
Improved Sample Complexity: The average prediction error on the downstream task is bounded by $\tilde{O}\left(\frac{d+D}{n\gamma^2}\right)$ . This is significantly better than the $\Omega\left(\frac{dD}{n\gamma^2}\right)$ sample complexity required for a standard linear classifier trained on the full $d \times D$ vectorized features, demonstrating the efficiency gained by reusing the learned variable selection mechanism.

Implementation Considerations & Practical Implications

Architecture Choice: The simplified one-layer architecture with a combined QK matrix and a value vector is amenable to theoretical analysis. Real-world applications might use standard Transformer blocks, but the core principle of attention learning structure could still apply.
Initialization: Zero initialization is crucial for the theoretical analysis. Practical Transformers often use specific initialization schemes (like Xavier/He), but the paper shows learning is possible from zero.
Optimization: Gradient descent is shown to work. The analysis uses population loss (infinite data), but experiments show similar behavior with SGD on finite datasets.
Positional Encoding: The specific orthogonal sinusoidal encoding facilitates the analysis. Other encodings might work but could change the learned structure of $\Wb_{2,2}$.
Benefit of Pre-training: The transfer learning result highlights a key benefit: pre-training on tasks with inherent structure (like group sparsity) allows the Transformer to learn efficient representations (focusing attention) that significantly accelerate learning on similar downstream tasks, requiring fewer samples.
Variable Selection: This work provides theoretical backing for the intuition that attention mechanisms can perform feature/variable selection, identifying and focusing on the most relevant parts of the input sequence.

Experiments

Numerical experiments on synthetic data confirm the theoretical predictions: loss converges, the value vector aligns, and the attention matrix correctly focuses on the $j^*$ -th group. Experiments on a modified CIFAR-10 task (embedding real images into noisy patches) further validate that the mechanism works on more complex, real-world-like data and that the model can identify the correct patch ( $j^*$ ) containing the true image, achieving good classification accuracy. High-dimensional experiments ( $d=100, D=100$ ) also show successful learning and attention focusing.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Chenyang Zhang (25 papers)
Xuran Meng (9 papers)
Yuan Cao (201 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/StatMLPapers/status/1911981247330480135