Transformer-Squared: Self-adaptive LLMs (2501.06252v3)

Published 9 Jan 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Self-adaptive LLMs aim to solve the challenges posed by traditional fine-tuning methods, which are often computationally intensive and static in their ability to handle diverse tasks. We introduce Transformer-Squared, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer-Squared employs a two-pass mechanism: first, a dispatch system identifies the task properties, and then task-specific 'expert' vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. Our method consistently outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency. Furthermore, Transformer-Squared demonstrates versatility across different LLM architectures and modalities, including vision-language tasks. Transformer-Squared represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs, paving the way for truly dynamic, self-organizing AI systems.

Summary

The paper introduces a self-adaptive LLM framework that uses Singular Value Fine-tuning (SVF) to adjust pre-trained weights with minimal parameters.
It details an SVF method that scales singular values for regularization and compositional skill integration, outperforming traditional approaches like LoRA.
Experimental results demonstrate improved performance on training tasks and robust generalization to unseen tasks through dynamic expert selection.

The paper "Transformer-Squared: Self-adaptive LLMs" (2501.06252) introduces a framework for enabling LLMs to adapt to unseen tasks during inference without conventional fine-tuning. Termed Transformer-Squared (Transformer^2), this approach aims to mitigate the computational costs, static nature, and potential task interference associated with methods like full fine-tuning or even parameter-efficient fine-tuning (PEFT) techniques when applied repetitively or naively generalized. The core idea is to pre-train task-specific "expert" components using a novel PEFT method called Singular Value Fine-tuning (SVF) and then dynamically combine or select these experts at inference time based on the incoming prompt.

Singular Value Fine-tuning (SVF) Methodology

SVF is presented as a parameter-efficient alternative to methods like Low-Rank Adaptation (LoRA). Instead of adding trainable low-rank matrices, SVF modifies the behavior of a pre-trained weight matrix $W \in \mathbb{R}^{m \times n}$ by directly scaling its singular values. Given the Singular Value Decomposition (SVD) of the weight matrix, $W = U \Sigma V^\intercal$ , where $U \in \mathbb{R}^{m \times r}$ , $\Sigma \in \mathbb{R}^{r \times r}$ is a diagonal matrix of singular values $\sigma_i$ , $V \in \mathbb{R}^{n \times r}$ , and $r = \min(m, n)$ is the rank, SVF introduces a trainable vector $z \in \mathbb{R}^r$ . This vector $z$ acts as a scaling factor for the singular values. The adapted weight matrix $W'$ is then computed as:

$W' = U \Sigma' V^\intercal$

where $\Sigma' = \Sigma \otimes \text{diag}(z)$ . Here, $\otimes$ denotes element-wise multiplication. This modification adjusts the contribution of each singular component $u_i v_i^\intercal$ to the layer's output.

Key properties attributed to SVF include:

Parameter Efficiency: For each matrix, SVF requires training only $r$ parameters (the elements of $z$ ), which is claimed to be significantly less than LoRA's $(m+n)r'$ parameters, where $r'$ is the LoRA rank. The paper reports SVF using less than 10% of the parameters used by LoRA in their experiments.
Compositionality: The authors argue that the learned $z$ vectors possess inherent compositionality. Because they operate independently on the orthogonal singular components of the weight matrix, linear combinations or other algebraic operations on $z$ vectors corresponding to different tasks are hypothesized to result in meaningful combinations of skills. This contrasts with LoRA, where the non-uniqueness of the low-rank decomposition $(W + BA = W + (BK)(K^{-1}A))$ complicates simple algebraic composition.
Regularization: By only scaling existing singular components rather than introducing entirely new parameters or directions (as in LoRA), SVF is framed as imposing a structural prior that regularizes training. This is suggested to reduce overfitting, especially when fine-tuning on small datasets, and improve stability.

SVF experts (the $z$ vectors) are trained using Reinforcement Learning (RL), specifically the REINFORCE algorithm, to directly optimize task-specific rewards. The objective function maximizes the expected reward $r(\hat{y}_i, y_i)$ for a generated output $\hat{y}_i$ given input $x_i$ and target $y_i$ , while regularizing against divergence from the original model's distribution using a KL divergence term:

$J(\theta_z) = E_{\hat{y}_i \sim \pi_{\theta_{W'}} (\cdot \mid x_i)}[\log(\pi_{\theta_{W'}}(\hat{y}_i \mid x_i))r(\hat{y}_i, y_i)] - \lambda D_\mathrm{KL}(\pi_{\theta_{W'}} \| \pi_{\theta_{W}})$

Here, $\theta_z$ represents the parameters of the SVF vector $z$ , $\pi_{\theta_{W'}}$ is the policy (LLM) using the SVF-modified weights, and $\pi_{\theta_{W}}$ is the policy using the original weights. The paper notes that this RL approach, combined with SVF's regularization, allows stable training even on datasets providing only correctness feedback, without needing detailed step-by-step reasoning data often used in supervised fine-tuning.

Transformer-Squared Self-Adaptation Framework

The Transformer² framework utilizes pre-trained SVF expert vectors ( $z^1, ..., z^K$ ), each specialized for a particular domain or task type (e.g., mathematics, coding, reasoning). The goal is to dynamically deploy these skills for an incoming prompt, potentially from a task distribution not seen during expert training. This is implemented via a two-pass inference mechanism:

First Pass (Adaptation): The system analyzes the input prompt $x$ and potentially some additional test-time context (e.g., few-shot examples) to determine an appropriate adaptation vector $z'$ . This involves selecting or constructing $z'$ based on the perceived task demands.
Second Pass (Generation): The LLM uses the adapted weights $W'$ , derived from $z'$ via the SVF formula $W' = U (\Sigma \otimes \text{diag}(z')) V^\intercal$ , to generate the final response $\hat{y}$ for the original prompt $x$ .

Three specific adaptation strategies are proposed and evaluated for determining $z'$ in the first pass:

Prompt Engineering: An "adaptation prompt" is constructed, prepended to the original prompt $x$ , asking the base LLM (using its original weights $W$ ) to classify $x$ into one of the $K$ expert domains or an "others" category. The pre-trained SVF expert $z^k$ corresponding to the predicted domain is selected as $z'$ . This strategy requires no additional training beyond the initial SVF experts.
Classification Expert: An additional SVF expert, $z^c$ , is trained specifically for the task of classifying input prompts into the $K$ expert domains. During the first pass, the model uses weights adapted by $z^c$ to classify the prompt $x$ . The corresponding expert $z^k$ is then selected as $z'$ . This aims to improve classification accuracy over simple prompt engineering.
Few-shot Adaptation: This strategy assumes access to a small number (e.g., 3-10) of example prompt-completion pairs from the target task distribution at test time. It seeks to find an optimal linear combination of the $K$ pre-trained experts: $z' = \sum_{k=1}^K \alpha_k z^k$ , where $\sum \alpha_k = 1$ and $\alpha_k \ge 0$ . The coefficients $\alpha_k$ are optimized using the Cross-Entropy Method (CEM), a derivative-free optimization technique. The objective for CEM is to maximize the average reward (e.g., generation likelihood or task-specific score) on the provided few-shot examples when using the combined expert $z'$ . This allows for creating a tailored adaptation vector $z'$ potentially suited for nuanced or mixed-domain unseen tasks. The CEM optimization is performed once per task, amortizing the cost over subsequent inferences for that task.

Experimental Results

The paper presents experiments primarily using Llama3-8B/70B and Mistral-7B models on tasks like GSM8K (math), MBPP-Pro (coding), and ARC-Easy (reasoning) for training SVF experts. Unseen task evaluation includes MATH, Humaneval, ARC-Challenge, and the vision-language tasks TextVQA and OKVQA (using LLaVA-1.5).

SVF vs. LoRA (Training Tasks): SVF reportedly outperformed both the base models and LoRA (trained via conventional next-token prediction on formatted instruction data) on the training tasks, while using substantially fewer parameters (<10% of LoRA). For instance, on Llama3-8B, SVF achieved scores of 78.4 on GSM8K and 77.6 on MBPP-Pro, compared to LoRA's 68.0 and 69.5, respectively (Table 1). Similar trends were observed for Llama3-70B and Mistral-7B.
Transformer² Adaptation (Unseen Tasks): On unseen tasks, the Transformer² adaptation strategies generally showed improvements over the base model. In contrast, simply selecting the best-performing LoRA checkpoint from the training tasks often failed to generalize or even degraded performance compared to the base model on these unseen tasks (Table 2). The adaptation strategies exhibited performance improvements correlating with the sophistication of the adaptation mechanism and the amount of test-time information used: Few-shot Adaptation > Classification Expert $\approx$ Prompt Engineering > Base Model. For example, on the MATH dataset with Llama3-8B, the base model scored 10.7, Prompt Engineering 11.4, Classification Expert 11.5, and Few-shot Adaptation 14.1. The best LoRA checkpoint (trained on GSM8K) scored 10.6 (Table 2).
Vision-Language Tasks: SVF improved performance on TextVQA over the base LLaVA model. Notably, Transformer² using language-only experts (math, code, reasoning) combined via Few-shot Adaptation improved performance on the unseen OKVQA task, suggesting cross-modal skill transfer (Figure 5).
Ablation Studies: SVF trained with RL (SVF+RL) significantly outperformed SVF trained with next-token prediction (SVF+NTP). Furthermore, LoRA trained with RL (LoRA+RL) proved unstable and performed worse than SVF+RL, highlighting the claimed stability benefits of SVF's regularization in the RL setting (Table 3).
Analysis: The Classification Expert strategy achieved higher accuracy in identifying the correct expert domain compared to Prompt Engineering (Figure 6). The weights ( $\alpha_k$ ) learned by CEM in Few-shot Adaptation often showed interpretable alignments (e.g., MATH task relying heavily on the GSM8K expert) but also revealed complex interactions, like the ARC expert contributing positively to the MATH task (Figure 7).
Cross-Model Compatibility: A striking claim is the successful transfer of SVF experts trained on Llama3-8B to Mistral-7B, and vice-versa, leading to performance gains. This is attributed to the structural similarity imposed by operating on ordered singular values. Combining experts from both models yielded further improvements (Table 5).
Inference Overhead: The first adaptation pass introduces overhead, but the paper argues it is manageable, particularly relative to potentially long generation times in the second pass. For Few-shot Adaptation using CEM, the overhead is incurred once per task (Table 4).

Advantages Claimed Over LoRA

Based on their methodology and results, the authors claim several advantages for the Transformer² framework utilizing SVF, compared to using LoRA:

Parameter Efficiency: SVF requires drastically fewer trainable parameters per adapted module ( $r$ vs. $(m+n)r'$ ), reducing storage and potentially computational costs for training and managing many experts.
Performance & Generalization: SVF achieved superior results on the training tasks compared to LoRA. More importantly, the Transformer² adaptation mechanism demonstrated better generalization to unseen tasks compared to selecting a single LoRA checkpoint.
Compositionality: SVF vectors ( $z$ ) are argued to be amenable to algebraic composition (e.g., linear interpolation via CEM) for combining skills, a property lacking in LoRA matrices due to representational ambiguities.
Training Stability & Flexibility: SVF's formulation provides regularization that stabilizes RL training, allowing direct optimization of task rewards with less complex data requirements compared to supervised fine-tuning often used with LoRA.
Reduced Overfitting: The results suggest SVF is less prone to overfitting on specific training tasks, contributing to its better generalization capability within the Transformer² framework.
Full-Rank Modification: Although parameter-efficient in terms of trainable parameters, SVF modifies the weight matrix in a full-rank manner by scaling all singular values, potentially offering greater expressive power than low-rank updates.

Conclusion

In conclusion, the "Transformer-Squared: Self-adaptive LLMs" paper proposes a novel framework for dynamic LLM adaptation at inference time. It introduces Singular Value Fine-tuning (SVF) as a parameter-efficient method for creating composable expert skills by scaling singular values of weight matrices, trained effectively via RL. The Transformer² framework then employs a two-pass mechanism with various strategies (Prompt Engineering, Classification Expert, Few-shot Adaptation) to analyze incoming prompts and dynamically apply the relevant SVF experts. The experimental results suggest advantages over LoRA in terms of parameter efficiency, performance on training tasks, generalization to unseen tasks through dynamic adaptation, and training stability, positioning Transformer² as a potential direction for creating more flexible and efficient LLMs. The claimed cross-model compatibility of SVF experts is also a notable finding.

PDF Markdown

Related Papers

Tweets

https://twitter.com/hardmaru/status/1879331049383334187

https://twitter.com/TechXplore_com/status/1882897385396813840

https://twitter.com/rohanpaul_ai/status/1879664407875375376

https://twitter.com/Traves_Theberge/status/1881338984049283465

https://twitter.com/WilliamLamkin/status/1879329153054609771

https://twitter.com/martin_gorner/status/1889073942473945361