Papers
Topics
Authors
Recent
2000 character limit reached

SA Proj. Tuning in Transformers

Updated 13 October 2025
  • Self-Attention Projections are learnable linear transformations that map inputs into query, key, and value subspaces, enabling effective attention routing.
  • SA Proj. tuning updates only the self-attention matrices, minimizing output distribution shifts and catastrophic forgetting while adapting to new tasks.
  • Empirical outcomes demonstrate that this approach yields high target accuracy with minimal forgetting, making it ideal for lifelong learning in multimodal models.

Self-Attention Projections (SA Proj.) designate the set of learnable linear transformations—namely the parameter matrices that project input representations into query, key, and value subspaces—for each self-attention block within a transformer architecture. In the context of large multimodal models (LMMs), targeted updating of only these self-attention projection layers has been empirically and mechanistically proven to confer substantial ability for sequential skill acquisition with strong retention of previously learned, general-purpose competencies (Zhu et al., 9 Oct 2025). When adapting LMMs to new tasks, this class of parameter updates contrasts with broader strategies (full model or full MLP tuning) by emphasizing re-routing (“who attends to what and in what combination”) rather than re-writing (altering output distributions or “memory contents”).

1. Structure and Role of Self-Attention Projection Layers

In every standard transformer block, the multi-head self-attention (MHA) mechanism computes: a()=WOsoftmax(WQLN(r(1))(WKLN(r(1)))dk)(WVLN(r(1)))a^{(\ell)} = W_O \cdot \mathrm{softmax} \left( \frac{W_Q \cdot \mathrm{LN}(r^{(\ell-1)}) \cdot (W_K \cdot \mathrm{LN}(r^{(\ell-1)}))^\top}{\sqrt{d_k}} \right ) \cdot (W_V \cdot \mathrm{LN}(r^{(\ell-1))}) where WQ,WK,WV,WOW_Q, W_K, W_V, W_O are the query, key, value, and output projection matrices and LN\mathrm{LN} denotes layer normalization. “Self-attention projection” (SA Proj.) tuning refers specifically to updating only these matrices (across all transformer blocks) during finetuning. The MLP parameters and all other weights (embedding, layer norm, etc.) are held fixed.

The effect is to allow the model to re-weight and recombine information from the input sequence—essentially “rewiring” routing and pattern-recognition within the model's memory—without overwriting the stored “knowledge” that is predominantly maintained in the feed-forward/MLP sublayers.

2. Mechanistic Foundations of Preservation and Transfer

Updating only SA Proj. produces an adaptation that is largely algorithmic and dynamic in nature. Because these projections govern how tokens or modality-specific embeddings are related and prioritized through attention routing (without modifying the content-generation or “writing” pathways), the risk of shifting the output token distribution is greatly reduced. By analogy, SA Proj. tuning modifies the “read” heads of a memory system, while leaving the “write” mechanism—implemented by the MLP Down projection—untouched.

Mathematically, this selective tuning can be formalized as: θSAθSAηθSALtask\theta_{\mathrm{SA}} \leftarrow \theta_{\mathrm{SA}} - \eta \nabla_{\theta_{\mathrm{SA}}} \mathcal{L}_{\mathrm{task}} with all other parameter gradients zeroed. Empirical metrics for output shift, such as the Numeric Token Bias probe (NTB_s), clearly confirm that the output distribution over tokens (especially for non-targeted skills) remains nearly stationary, as opposed to the substantial drift observed when MLP “Down” layers are also updated.

3. Empirical Outcomes: Learning and Retention

Experimental results on sequential skill tuning in LMMs demonstrate that updating only SA Proj. delivers strong improvement on new, narrow tasks while limiting catastrophic forgetting:

  • Targeted skill learning: Improvement of +24.9 points (target task accuracy).
  • Retention: Minimal drop (−0.6 points) on eight held-out benchmarks, compared to up to –23 points with full-model finetuning.

An associated approach, tuning only MLP “Gate&Up” sublayers while freezing “Down” (output) projections, achieves similar target performance (+30.5) with slightly higher forgetting (−2.1). However, SA Proj. tuning is empirically more robust in preventing output distribution shift, as measured by the counting-bias probe NTB_s: NTBs=E(I,y)B[1yj=1ymaxcCp(cy<j,Ivis)]\mathrm{NTB}_s = \mathbb{E}_{(I,y)\in B} \left[ \frac{1}{|y|} \sum_{j=1}^{|y|} \max_{c \in C} p(c|y_{<j}, I_{\mathrm{vis}}) \right ] For SA Proj.-only updates, NTB_s remains near baseline; MLP tuning (without Down freezing) produces pronounced shifts.

Finetuning Method Target Gain Forgetting (Held-out) NTB_s Shift
Full Model +25 −23 High
SA Proj. Only +23.1 −0.6 Low
MLP (Gate&Up Only) +30.5 −2.1 Low-Medium

4. Underlying Causes of Output Bias and Forgetting

Full-model or full-MLP updates disrupt stored representations by shifting the model’s output token distribution (“counting bias” in numeracy tasks), a form of catastrophic forgetting. This arises because MLP Down projections implement the “write” operation that persists task-specific concepts or output preferences. Once shifted, the generic ability to recall or execute unrelated skills degrades, as measured independently across held-out benchmarks.

SA Proj. tuning, by constraining updates to the attention projections, focuses solely on information routing without modifying or overwriting this written memory, providing a mechanism for skill acquisition that is “read/write separated.”

5. Practical Implementation and Diagnostic Strategies

Tuning only SA Proj. is straightforward to integrate: during task-specific finetuning, gradient updates are restricted to WQ,WK,WV,WOW_Q, W_K, W_V, W_O (optionally for both cross-modality and self-modality layers in LMMs). Modern deep learning frameworks (e.g., PyTorch) can filter parameter groups accordingly. Evaluating performance involves:

  • Monitoring target task accuracy.
  • Tracking forgetting via held-out multi-domain benchmarks.
  • Probing output distribution shift with auxiliary diagnostics (such as NTB_s for numeracy or probabilistic bias measures).

A plausible implication is that models requiring robust lifelong learning or continual adaptation—such as digital personal assistants or foundation vision-and-LLMs—may benefit from this routing-centered tuning paradigm to preserve core abilities while learning new skills.

6. Comparison to MLP and Other Selective Tuning Variants

While MLP (Gate&Up) tuning with Down frozen provides similar mitigation of forgetting, it operates further downstream, slightly increasing risk of output bias due to any residual “write-pathway” adjustment. SA Proj. tuning is more closely tied to the model’s data-dependent algorithmic pathways and less to the explicit “memory storage,” offering an optimal trade-off where holding onto prior knowledge is paramount. Full-model tuning achieves only slightly better learning curves on the new skill at the expense of massive output bias and knowledge loss on unrelated tasks.

7. Implications for Continual and Lifelong Learning in Multimodal Models

The selective updating of self-attention projections establishes an effective method for teaching large-scale multimodal models new skills in a robust, minimally destructive manner (Zhu et al., 9 Oct 2025). This has direct consequences for real-world deployment in environments with non-stationary objectives. Models can be taught to solve new, highly targeted tasks (such as image counting, spatial reasoning, or visual QA) without erasing capabilities such as language understanding, visual dialog, or arithmetic. Analysis of counting bias, performance drift, and retained accuracy after sequential skill updates provides a robust quantitative framework for tracking the preservation of general-purpose competencies.

The evidence indicates that, for the class of transformer-based LMMs, self-attention projections serve as an efficient and effective locus of adaptation for multi-task, lifelong learning scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Self-Attention Projections (SA Proj.).