Papers
Topics
Authors
Recent
Search
2000 character limit reached

VeRA Adapter: Efficient Neural Adaptation

Updated 10 December 2025
  • VeRA Adapter is a framework that adapts large pre-trained neural networks by sharing a global pair of frozen random matrices while learning small, per-layer scaling vectors.
  • It achieves significant parameter and storage reductions compared to LoRA, reducing trainable parameters from hundreds of thousands to as little as 24K while maintaining competitive performance.
  • The approach extends to a probabilistic variant, PVeRA, which adds uncertainty estimation capabilities for calibrated predictions in both NLP and vision tasks.

VeRA Adapter (Vector-based Random Matrix Adaptation) is a parameter-efficient framework for adapting large pre-trained neural networks, such as Transformers and Vision Transformers (ViTs), with minimal additional trainable parameters. VeRA achieves significant reductions in both parameter count and storage requirements relative to prior low-rank adaptation approaches such as LoRA, while maintaining downstream task performance. The technique is underpinned by sharing frozen random low-rank projection matrices throughout the model and learning only lightweight, per-layer scaling vectors, enabling highly compressed adapter modules for large language and vision models (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).

1. Mathematical Formulation of VeRA

Let W0Rm×nW_0 \in \mathbb{R}^{m \times n} be a frozen pre-trained weight matrix, as typically found in a Transformer block (e.g., for the Q or V projection in MHSA or an MLP linear). Rather than fully fine-tuning W0W_0, VeRA models the adaptation as an additive low-rank update:

h=W0x+ΔWxh = W_0 x + \Delta W x

Where LoRA parameterizes the update as ΔW=BA\Delta W = B A with low-rank factors ARr×nA \in \mathbb{R}^{r \times n}, BRm×rB \in \mathbb{R}^{m \times r} (with AA, BB learned per layer), VeRA instead shares a single global random pair (A,B)(A, B) across all adapted layers and learns only per-layer diagonal scaling vectors:

  • d()Rrd^{(\ell)} \in \mathbb{R}^r: scaling for AA (input-side)
  • b()Rmb^{(\ell)} \in \mathbb{R}^m: scaling for BB (output-side)

For layer \ell, the adapted update is:

ΔW()=diag(b())Bdiag(d())A\Delta W^{(\ell)} = \operatorname{diag}(b^{(\ell)})\, B\, \operatorname{diag}(d^{(\ell)})\, A

with the new projected activation:

h()=W0()x+diag(b())Bdiag(d())(Ax)h^{(\ell)} = W_0^{(\ell)} x + \operatorname{diag}(b^{(\ell)})\, B\, \operatorname{diag}(d^{(\ell)})\, (A x)

Effectively, only the small vectors d()d^{(\ell)} and b()b^{(\ell)} (of sizes rr and mm, respectively) are trained and stored per layer, while AA, BB are held constant and shared throughout the network (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).

2. Comparison to LoRA and Other Adapters

The key distinction between LoRA and VeRA lies in parameterization and resource requirements:

LoRA VeRA
Trainable Params (per layer) mr+rnm \cdot r + r \cdot n m+rm + r
Low-rank matrices Separate A(),B()A^{(\ell)},B^{(\ell)} per layer Global A,BA,B (frozen, shared)
Storage A(),B()A^{(\ell)},B^{(\ell)} Seed, b()b^{(\ell)}, d()d^{(\ell)}
Typical reduction 1–2 orders of magnitude fewer params for equal rr
Empirical performance Baseline Matches or outperforms for same FLOPs/params

For example, adapting LL layers on a dmodel=1024d_\mathrm{model} = 1024 model with r=16r = 16:

  • LoRA: 2L102416=786,4322 \cdot L \cdot 1024 \cdot 16 = 786{,}432 trainable params
  • VeRA: L(1024+16)=241040=24,960L \cdot (1024 + 16) = 24 \cdot 1040 = 24{,}960 params

Empirical studies on GLUE, E2E, and image recognition tasks show VeRA matches or slightly surpasses LoRA performance despite this compression (Kopiczko et al., 2023).

3. Implementation and Initialization

The global matrices ARr×nA \in \mathbb{R}^{r \times n} and BRm×rB \in \mathbb{R}^{m \times r} are sampled only once (e.g., with Kaiming initialization), and not updated thereafter. Per-layer scaling vectors are initialized with d()constd^{(\ell)} \sim \mathrm{const} (e.g., $0.1$ or 10710^{-7}) and b()=0b^{(\ell)} = 0, so the initial effect on W0W_0 is neutral.

A canonical PyTorch-style implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
A = nn.Parameter(torch.empty(r, n), requires_grad=False)
B = nn.Parameter(torch.empty(m, r), requires_grad=False)
kaiming_uniform_(A)
kaiming_uniform_(B)

class VeRAAdapter(nn.Module):
    def __init__(self, m, r):
        super().__init__()
        self.d = nn.Parameter(torch.full((r,), d_init))
        self.b = nn.Parameter(torch.zeros(m))
    def forward(self, x, W0):
        # x: (batch, n), W0: frozen m×n
        h0 = x.matmul(W0.T)
        y = A.matmul(x.T)           # [r, batch]
        y = (self.d[:,None] * y)    # scale rows
        y = B.matmul(y)             # [m, batch]
        y = (self.b[:,None] * y)    # output scale
        y = y.T
        return h0 + y

On model deployment, only the seed for (A,B)(A,B) and all d(),b()d^{(\ell)},b^{(\ell)} must be stored. Backpropagation updates only the scaling vectors (Kopiczko et al., 2023).

4. Empirical Results and Benchmarks

VeRA was evaluated in various settings, including NLP (GLUE, E2E, instruction tuning with Llama7B/13B) and vision (CIFAR100, Food101, Flowers102, RESISC45 with ViT-B/L).

  • On GLUE: RoBERTa-base (adapt Q/V, r=1024r=1024): $0.043$M params, $85.2$ avg. score, vs. LoRA $86.6$ ($0.3$M params)
  • ViT-B: CIFAR100, rank=256: $24.6$K params (VeRA) vs $294$K (LoRA); accuracy within ±1\pm 1 pt.
  • Instruction tuning: Llama2-7B, r=1024r=1024, $1.6$M params, MT-Bench $4.77$ (vs. LoRA $5.03$, $159.9$M params)

Across all cases, VeRA achieves comparably high accuracy with a fraction of trainable and storable adapter weights (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).

5. Extensions: Probabilistic VeRA (PVeRA)

PVeRA is a probabilistic variant that enables uncertainty estimation and confidence-aware predictions while preserving VeRA’s parameter efficiency (Fillioux et al., 8 Dec 2025). The key modifications:

  • Treat the low-rank code zqz_q as a latent variable with a learned mean μq(x)\mu_q(x) and standard deviation σq(x)\sigma_q(x).
  • At each forward pass, sample zq=μq(x)+σq(x)ϵz_q = \mu_q(x) + \sigma_q(x) \odot \epsilon with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I) (reparameterization trick).
  • Add a KL\mathrm{KL}-divergence penalty to the loss:

Ltotal=Lclassification+βlayersDKL(q(zx)N(0,I))\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{classification} + \beta \sum_{\text{layers}} D_{KL}(q(z|x) || \mathcal{N}(0, I))

Hyperparameter β\beta controls regularization strength.

At inference, two modes are supported:

  • Deterministic: set zq=μq(x)z_q = \mu_q(x) and merge into W0W_0, yielding zero overhead.
  • Probabilistic: sample multiple zqz_q for calibrated uncertainty.

On VTAB-1k, PVeRA yielded 71.4%71.4\% average accuracy (30K params) vs 69.9%69.9\% for VeRA and 70.5%70.5\% for LoRA (393K params), with statistically significant improvements on several tasks (Fillioux et al., 8 Dec 2025).

6. Practical Recommendations and Guidelines

  • Rank Selection: Start with small rr (1–4) and increase as needed for the task; r=256r=256 is optimal on VTAB-1k (Fillioux et al., 8 Dec 2025).
  • Learning Rate: Use higher learning rates for adapter vectors (bb, dd) than for head; e.g., 1e21\mathrm{e}{-2} for adapters, 4e34\mathrm{e}{-3} for head.
  • Storage: Only a seed (to reconstruct AA, BB) and the minuscule per-layer vectors need to be stored or transmitted for deployment.
  • Adapter Placement: In vision models, adapting both Q and V branches outperforms Q alone, V alone, or all projections (Fillioux et al., 8 Dec 2025).
  • Probabilistic Extension: Use PVeRA for tasks requiring confidence intervals or out-of-distribution detection.
  • Deployment: For pure efficiency, use the deterministic (weight-merged) inference mode to incur no compute overhead compared to the original model.

7. References and Place in Adapter Landscape

VeRA was introduced by Kopiczko et al. in 2023 (Kopiczko et al., 2023), positioned as a successor in the PEFT (Parameter-Efficient Fine-Tuning) landscape, improving storage and efficiency over LoRA by leveraging the empirical observation of low intrinsic adaptation dimension in large pre-trained networks. PVeRA extends this with probabilistic, variational Bayesian formulations to further enable calibrated prediction (Fillioux et al., 8 Dec 2025). Adoption has spanned language, vision, and instruction tuning tasks.

These advances reinforce the utility of random projections and diagonal scaling as a scalable, general adaption strategy in the era of increasingly large pre-trained foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VeRA Adapter.