VeRA Adapter: Efficient Neural Adaptation

Updated 10 December 2025

VeRA Adapter is a framework that adapts large pre-trained neural networks by sharing a global pair of frozen random matrices while learning small, per-layer scaling vectors.
It achieves significant parameter and storage reductions compared to LoRA, reducing trainable parameters from hundreds of thousands to as little as 24K while maintaining competitive performance.
The approach extends to a probabilistic variant, PVeRA, which adds uncertainty estimation capabilities for calibrated predictions in both NLP and vision tasks.

VeRA Adapter (Vector-based Random Matrix Adaptation) is a parameter-efficient framework for adapting large pre-trained neural networks, such as Transformers and Vision Transformers (ViTs), with minimal additional trainable parameters. VeRA achieves significant reductions in both parameter count and storage requirements relative to prior low-rank adaptation approaches such as LoRA, while maintaining downstream task performance. The technique is underpinned by sharing frozen random low-rank projection matrices throughout the model and learning only lightweight, per-layer scaling vectors, enabling highly compressed adapter modules for large language and vision models (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).

1. Mathematical Formulation of VeRA

Let $W_0 \in \mathbb{R}^{m \times n}$ be a frozen pre-trained weight matrix, as typically found in a Transformer block (e.g., for the Q or V projection in MHSA or an MLP linear). Rather than fully fine-tuning $W_0$ , VeRA models the adaptation as an additive low-rank update:

$h = W_0 x + \Delta W x$

Where LoRA parameterizes the update as $\Delta W = B A$ with low-rank factors $A \in \mathbb{R}^{r \times n}$ , $B \in \mathbb{R}^{m \times r}$ (with $A$ , $B$ learned per layer), VeRA instead shares a single global random pair $(A, B)$ across all adapted layers and learns only per-layer diagonal scaling vectors:

$d^{(\ell)} \in \mathbb{R}^r$ : scaling for $A$ (input-side)
$b^{(\ell)} \in \mathbb{R}^m$ : scaling for $B$ (output-side)

For layer $\ell$ , the adapted update is:

$\Delta W^{(\ell)} = \operatorname{diag}(b^{(\ell)})\, B\, \operatorname{diag}(d^{(\ell)})\, A$

with the new projected activation:

$h^{(\ell)} = W_0^{(\ell)} x + \operatorname{diag}(b^{(\ell)})\, B\, \operatorname{diag}(d^{(\ell)})\, (A x)$

Effectively, only the small vectors $d^{(\ell)}$ and $b^{(\ell)}$ (of sizes $r$ and $m$ , respectively) are trained and stored per layer, while $A$ , $B$ are held constant and shared throughout the network (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).

2. Comparison to LoRA and Other Adapters

The key distinction between LoRA and VeRA lies in parameterization and resource requirements:

	LoRA	VeRA
Trainable Params (per layer)	$m \cdot r + r \cdot n$	$m + r$
Low-rank matrices	Separate $A^{(\ell)},B^{(\ell)}$ per layer	Global $A,B$ (frozen, shared)
Storage	$A^{(\ell)},B^{(\ell)}$	Seed, $b^{(\ell)}$ , $d^{(\ell)}$
Typical reduction	1–2 orders of magnitude fewer params for equal $r$	—
Empirical performance	Baseline	Matches or outperforms for same FLOPs/params

For example, adapting $L$ layers on a $d_\mathrm{model} = 1024$ model with $r = 16$ :

LoRA: $2 \cdot L \cdot 1024 \cdot 16 = 786{,}432$ trainable params
VeRA: $L \cdot (1024 + 16) = 24 \cdot 1040 = 24{,}960$ params

Empirical studies on GLUE, E2E, and image recognition tasks show VeRA matches or slightly surpasses LoRA performance despite this compression (Kopiczko et al., 2023).

3. Implementation and Initialization

The global matrices $A \in \mathbb{R}^{r \times n}$ and $B \in \mathbb{R}^{m \times r}$ are sampled only once (e.g., with Kaiming initialization), and not updated thereafter. Per-layer scaling vectors are initialized with $d^{(\ell)} \sim \mathrm{const}$ (e.g., $0.1$ or $10^{-7}$ ) and $b^{(\ell)} = 0$ , so the initial effect on $W_0$ is neutral.

A canonical PyTorch-style implementation:

A = nn.Parameter(torch.empty(r, n), requires_grad=False)
B = nn.Parameter(torch.empty(m, r), requires_grad=False)
kaiming_uniform_(A)
kaiming_uniform_(B)

class VeRAAdapter(nn.Module):
    def __init__(self, m, r):
        super().__init__()
        self.d = nn.Parameter(torch.full((r,), d_init))
        self.b = nn.Parameter(torch.zeros(m))
    def forward(self, x, W0):
        # x: (batch, n), W0: frozen m×n
        h0 = x.matmul(W0.T)
        y = A.matmul(x.T)           # [r, batch]
        y = (self.d[:,None] * y)    # scale rows
        y = B.matmul(y)             # [m, batch]
        y = (self.b[:,None] * y)    # output scale
        y = y.T
        return h0 + y

On model deployment, only the seed for $(A,B)$ and all $d^{(\ell)},b^{(\ell)}$ must be stored. Backpropagation updates only the scaling vectors (Kopiczko et al., 2023).

4. Empirical Results and Benchmarks

VeRA was evaluated in various settings, including NLP (GLUE, E2E, instruction tuning with Llama7B/13B) and vision (CIFAR100, Food101, Flowers102, RESISC45 with ViT-B/L).

On GLUE: RoBERTa-base (adapt Q/V, $r=1024$ ): $0.043$M params, $85.2$ avg. score, vs. LoRA $86.6$ ($0.3$M params)
ViT-B: CIFAR100, rank=256: $24.6$K params (VeRA) vs $294$K (LoRA); accuracy within $\pm 1$ pt.
Instruction tuning: Llama2-7B, $r=1024$ , $1.6$M params, MT-Bench $4.77$ (vs. LoRA $5.03$, $159.9$M params)

Across all cases, VeRA achieves comparably high accuracy with a fraction of trainable and storable adapter weights (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).

5. Extensions: Probabilistic VeRA (PVeRA)

PVeRA is a probabilistic variant that enables uncertainty estimation and confidence-aware predictions while preserving VeRA’s parameter efficiency (Fillioux et al., 8 Dec 2025). The key modifications:

Treat the low-rank code $z_q$ as a latent variable with a learned mean $\mu_q(x)$ and standard deviation $\sigma_q(x)$ .
At each forward pass, sample $z_q = \mu_q(x) + \sigma_q(x) \odot \epsilon$ with $\epsilon \sim \mathcal{N}(0, I)$ (reparameterization trick).
Add a $\mathrm{KL}$ -divergence penalty to the loss:

$\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{classification} + \beta \sum_{\text{layers}} D_{KL}(q(z|x) || \mathcal{N}(0, I))$

Hyperparameter $\beta$ controls regularization strength.

At inference, two modes are supported:

Deterministic: set $z_q = \mu_q(x)$ and merge into $W_0$ , yielding zero overhead.
Probabilistic: sample multiple $z_q$ for calibrated uncertainty.

On VTAB-1k, PVeRA yielded $71.4\%$ average accuracy (30K params) vs $69.9\%$ for VeRA and $70.5\%$ for LoRA (393K params), with statistically significant improvements on several tasks (Fillioux et al., 8 Dec 2025).

6. Practical Recommendations and Guidelines

Rank Selection: Start with small $r$ (1–4) and increase as needed for the task; $r=256$ is optimal on VTAB-1k (Fillioux et al., 8 Dec 2025).
Learning Rate: Use higher learning rates for adapter vectors ( $b$ , $d$ ) than for head; e.g., $1\mathrm{e}{-2}$ for adapters, $4\mathrm{e}{-3}$ for head.
Storage: Only a seed (to reconstruct $A$ , $B$ ) and the minuscule per-layer vectors need to be stored or transmitted for deployment.
Adapter Placement: In vision models, adapting both Q and V branches outperforms Q alone, V alone, or all projections (Fillioux et al., 8 Dec 2025).
Probabilistic Extension: Use PVeRA for tasks requiring confidence intervals or out-of-distribution detection.
Deployment: For pure efficiency, use the deterministic (weight-merged) inference mode to incur no compute overhead compared to the original model.

7. References and Place in Adapter Landscape

VeRA was introduced by Kopiczko et al. in 2023 (Kopiczko et al., 2023), positioned as a successor in the PEFT (Parameter-Efficient Fine-Tuning) landscape, improving storage and efficiency over LoRA by leveraging the empirical observation of low intrinsic adaptation dimension in large pre-trained networks. PVeRA extends this with probabilistic, variational Bayesian formulations to further enable calibrated prediction (Fillioux et al., 8 Dec 2025). Adoption has spanned language, vision, and instruction tuning tasks.

These advances reinforce the utility of random projections and diagonal scaling as a scalable, general adaption strategy in the era of increasingly large pre-trained foundation models.

Markdown Report Issue Upgrade to Chat

References (2)

VeRA: Vector-based Random Matrix Adaptation (2023)

PVeRA: Probabilistic Vector-Based Random Matrix Adaptation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VeRA Adapter.