VeRA: Vector-based Random Matrix Adaptation (2310.11454v2)

Published 17 Oct 2023 in cs.CL

Abstract: Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning LLMs, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B LLMs.

Citations (80)

View on Semantic Scholar

Summary

The paper introduces VeRA, a method that leverages shared low-rank matrices and trainable scaling vectors to drastically reduce the number of parameters in model finetuning.
It demonstrates competitive results on benchmarks like GLUE, E2E, and image classification, achieving similar or superior metrics compared to LoRA with significantly fewer parameters.
Its practical implications include reduced memory usage and efficient deployment in both cloud and edge devices, while also inspiring future research into lower-dimensional adaptation spaces.

Vector-based Random Matrix Adaptation (VeRA): A Comprehensive Overview

Introduction

In the field of LLMs, the challenge of efficient model adaptation has become paramount. This paper introduces Vector-based Random Matrix Adaptation (VeRA), a novel finetuning method designed to minimize the number of trainable parameters while maintaining performance comparable to existing state-of-the-art approaches like Low-Rank Adaptation (LoRA). The primary innovation in VeRA is the use of a single pair of randomly initialized, low-rank matrices shared across all layers, reparameterized with trainable scaling vectors. This method results in significant memory savings, facilitates efficient model deployment, and is particularly suited for scalable applications in cloud-based AI services and edge devices.

Methodology

Core Mechanism of VeRA

VeRA leverages two sets of low-rank matrices, $A$ and $B$ , which are frozen and shared across all layers of the model. The adaptation is achieved through trainable scaling vectors, $d$ and $b$ , that adjust the influence of these matrices on a per-layer basis. In mathematical terms, the adjustment to an initial weight matrix $W_0$ is:

$h = W_0x + \Delta W x = W_0x + \boldsymbol{\Lambda_{b}} B \boldsymbol{\Lambda_{d}} A x$

where $\Lambda_{b}$ and $\Lambda_{d}$ are diagonal matrices derived from the scaling vectors $b$ and $d$ respectively.

Initialization Strategies

The shared matrices $A$ and $B$ are initialized using Kaiming initialization for numerical stability, while the scaling vectors $b$ and $d$ are initialized such that $b$ starts at zero and $d$ with a small constant value. This initialization ensures that the model's initial behavior remains close to the pretrained state, facilitating stable and gradual adaptation during finetuning.

Experimental Evaluation

GLUE Benchmark Performance

The evaluation on the General Language Understanding Evaluation (GLUE) benchmark indicates that VeRA achieves performance comparable to LoRA on both RoBERTa\textsubscript{base} and RoBERTa\textsubscript{large} models but with a significantly reduced parameter count. For example, VeRA manages to train RoBERTa\textsubscript{base} with just 43K parameters achieving an average performance of 85.2, closely paralleling LoRA's 86.6 with 300K parameters. This demonstrates VeRA's superior parameter efficiency while maintaining high predictive accuracy.

E2E Benchmark and Instruction Tuning

On the E2E benchmark, VeRA also outperforms LoRA with a three to four-fold reduction in trainable parameters for GPT-2 models while achieving better BLEU, NIST, METEOR, and ROUGE-L scores. Similarly, in instruction tuning of LLaMA models, VeRA achieves comparable scores to LoRA on the MT-Bench while using 100 times fewer trainable parameters. These results underscore VeRA’s capability in various adaptation scenarios, including language generation tasks and instruction tuning.

Image Classification Tasks

Further investigation on image classification tasks using Vision Transformers (ViT) affirms the versatility of VeRA. For instance, when finetuning ViT on CIFAR100, Food101, Flowers102, and RESISC45 datasets, VeRA retains high classification accuracy while reducing the number of trainable parameters by an order of magnitude compared to LoRA.

Implications

Practical Applications

VeRA's substantial reduction in trainable parameters has immediate practical implications. Primarily, it lowers memory requirements, which is critical for deploying multiple model instances on a single GPU, thereby enhancing serving efficiency in cloud environments. This is especially beneficial for personalized AI services, enabling context-specific adjustments without necessitating extensive additional storage.

Theoretical Insights and Future Directions

The paper also raises intriguing questions about the latent structure and dimensionality of model adaptation spaces. VeRA’s efficiency suggests that pretrained models can be finetuned effectively within lower-dimensional subspaces than previously assumed. This points to potential future explorations in dynamic parameter budget allocations and novel initialization strategies to further optimize training efficiency and model performance.

Conclusion

VeRA introduces a significant advancement in the parameter-efficient adaptation of LLMs. It provides comparable performance to existing methods like LoRA with drastically fewer trainable parameters, thereby enhancing deployment feasibility and serving efficiency. The experimental results across different benchmarks and tasks validate its versatility and effectiveness, marking it as a promising direction for future research in model finetuning. This method holds considerable potential for widespread application, especially in environments where computational resources are at a premium.