VeRA Adapter: Efficient Neural Adaptation
- VeRA Adapter is a framework that adapts large pre-trained neural networks by sharing a global pair of frozen random matrices while learning small, per-layer scaling vectors.
- It achieves significant parameter and storage reductions compared to LoRA, reducing trainable parameters from hundreds of thousands to as little as 24K while maintaining competitive performance.
- The approach extends to a probabilistic variant, PVeRA, which adds uncertainty estimation capabilities for calibrated predictions in both NLP and vision tasks.
VeRA Adapter (Vector-based Random Matrix Adaptation) is a parameter-efficient framework for adapting large pre-trained neural networks, such as Transformers and Vision Transformers (ViTs), with minimal additional trainable parameters. VeRA achieves significant reductions in both parameter count and storage requirements relative to prior low-rank adaptation approaches such as LoRA, while maintaining downstream task performance. The technique is underpinned by sharing frozen random low-rank projection matrices throughout the model and learning only lightweight, per-layer scaling vectors, enabling highly compressed adapter modules for large language and vision models (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).
1. Mathematical Formulation of VeRA
Let be a frozen pre-trained weight matrix, as typically found in a Transformer block (e.g., for the Q or V projection in MHSA or an MLP linear). Rather than fully fine-tuning , VeRA models the adaptation as an additive low-rank update:
Where LoRA parameterizes the update as with low-rank factors , (with , learned per layer), VeRA instead shares a single global random pair across all adapted layers and learns only per-layer diagonal scaling vectors:
- : scaling for (input-side)
- : scaling for (output-side)
For layer , the adapted update is:
with the new projected activation:
Effectively, only the small vectors and (of sizes and , respectively) are trained and stored per layer, while , are held constant and shared throughout the network (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).
2. Comparison to LoRA and Other Adapters
The key distinction between LoRA and VeRA lies in parameterization and resource requirements:
| LoRA | VeRA | |
|---|---|---|
| Trainable Params (per layer) | ||
| Low-rank matrices | Separate per layer | Global (frozen, shared) |
| Storage | Seed, , | |
| Typical reduction | 1–2 orders of magnitude fewer params for equal | — |
| Empirical performance | Baseline | Matches or outperforms for same FLOPs/params |
For example, adapting layers on a model with :
- LoRA: trainable params
- VeRA: params
Empirical studies on GLUE, E2E, and image recognition tasks show VeRA matches or slightly surpasses LoRA performance despite this compression (Kopiczko et al., 2023).
3. Implementation and Initialization
The global matrices and are sampled only once (e.g., with Kaiming initialization), and not updated thereafter. Per-layer scaling vectors are initialized with (e.g., $0.1$ or ) and , so the initial effect on is neutral.
A canonical PyTorch-style implementation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
A = nn.Parameter(torch.empty(r, n), requires_grad=False) B = nn.Parameter(torch.empty(m, r), requires_grad=False) kaiming_uniform_(A) kaiming_uniform_(B) class VeRAAdapter(nn.Module): def __init__(self, m, r): super().__init__() self.d = nn.Parameter(torch.full((r,), d_init)) self.b = nn.Parameter(torch.zeros(m)) def forward(self, x, W0): # x: (batch, n), W0: frozen m×n h0 = x.matmul(W0.T) y = A.matmul(x.T) # [r, batch] y = (self.d[:,None] * y) # scale rows y = B.matmul(y) # [m, batch] y = (self.b[:,None] * y) # output scale y = y.T return h0 + y |
On model deployment, only the seed for and all must be stored. Backpropagation updates only the scaling vectors (Kopiczko et al., 2023).
4. Empirical Results and Benchmarks
VeRA was evaluated in various settings, including NLP (GLUE, E2E, instruction tuning with Llama7B/13B) and vision (CIFAR100, Food101, Flowers102, RESISC45 with ViT-B/L).
- On GLUE: RoBERTa-base (adapt Q/V, ): $0.043$M params, $85.2$ avg. score, vs. LoRA $86.6$ ($0.3$M params)
- ViT-B: CIFAR100, rank=256: $24.6$K params (VeRA) vs $294$K (LoRA); accuracy within pt.
- Instruction tuning: Llama2-7B, , $1.6$M params, MT-Bench $4.77$ (vs. LoRA $5.03$, $159.9$M params)
Across all cases, VeRA achieves comparably high accuracy with a fraction of trainable and storable adapter weights (Kopiczko et al., 2023, Fillioux et al., 8 Dec 2025).
5. Extensions: Probabilistic VeRA (PVeRA)
PVeRA is a probabilistic variant that enables uncertainty estimation and confidence-aware predictions while preserving VeRA’s parameter efficiency (Fillioux et al., 8 Dec 2025). The key modifications:
- Treat the low-rank code as a latent variable with a learned mean and standard deviation .
- At each forward pass, sample with (reparameterization trick).
- Add a -divergence penalty to the loss:
Hyperparameter controls regularization strength.
At inference, two modes are supported:
- Deterministic: set and merge into , yielding zero overhead.
- Probabilistic: sample multiple for calibrated uncertainty.
On VTAB-1k, PVeRA yielded average accuracy (30K params) vs for VeRA and for LoRA (393K params), with statistically significant improvements on several tasks (Fillioux et al., 8 Dec 2025).
6. Practical Recommendations and Guidelines
- Rank Selection: Start with small (1–4) and increase as needed for the task; is optimal on VTAB-1k (Fillioux et al., 8 Dec 2025).
- Learning Rate: Use higher learning rates for adapter vectors (, ) than for head; e.g., for adapters, for head.
- Storage: Only a seed (to reconstruct , ) and the minuscule per-layer vectors need to be stored or transmitted for deployment.
- Adapter Placement: In vision models, adapting both Q and V branches outperforms Q alone, V alone, or all projections (Fillioux et al., 8 Dec 2025).
- Probabilistic Extension: Use PVeRA for tasks requiring confidence intervals or out-of-distribution detection.
- Deployment: For pure efficiency, use the deterministic (weight-merged) inference mode to incur no compute overhead compared to the original model.
7. References and Place in Adapter Landscape
VeRA was introduced by Kopiczko et al. in 2023 (Kopiczko et al., 2023), positioned as a successor in the PEFT (Parameter-Efficient Fine-Tuning) landscape, improving storage and efficiency over LoRA by leveraging the empirical observation of low intrinsic adaptation dimension in large pre-trained networks. PVeRA extends this with probabilistic, variational Bayesian formulations to further enable calibrated prediction (Fillioux et al., 8 Dec 2025). Adoption has spanned language, vision, and instruction tuning tasks.
These advances reinforce the utility of random projections and diagonal scaling as a scalable, general adaption strategy in the era of increasingly large pre-trained foundation models.