FVAE-LoRA: Fine-Tuning via Latent Factorization
- The paper introduces FVAE-LoRA, which integrates a dual-latent VAE into the LoRA framework to explicitly separate task-salient from residual features.
- It employs a tailored ELBO objective with repulsive priors to enforce latent-space factorization, improving robustness to spurious correlations and distribution shifts.
- Empirical results across vision, language, and audio tasks demonstrate that FVAE-LoRA outperforms standard LoRA while keeping parameter increases minimal.
FVAE-LoRA is a parameter-efficient fine-tuning (PEFT) framework that augments the standard Low-Rank Adaptation (LoRA) method with an explicit latent-space factorization. By embedding a small variational autoencoder (VAE) at adaptation layers, FVAE-LoRA learns to disentangle task-salient from residual features in the low-rank update space, leading to increased robustness and downstream generalization, especially under spurious correlations and distribution shifts. Unlike standard LoRA variants, FVAE-LoRA utilizes a tailored Evidence Lower Bound (ELBO) objective to enforce this factorization without increasing inference-time parameter budgets (Kumar et al., 22 Oct 2025).
1. Architectural Design and Latent Factorization
FVAE-LoRA operates by replacing the static LoRA low-rank update mechanism with an encoder-decoder (VAE) pipeline. Given a frozen linear transformation and input activation , the standard LoRA approach augments with for rank factors and . FVAE-LoRA introduces two parallel encoders , each mapping to latent codes :
- parameterizes the low-rank update through 0, driving the downstream task.
- 1 captures residual or non-task-salient information.
Each encoder is a two-layer MLP producing diagonal-Gaussian posteriors: 2. The decoder 3 is a shallow MLP reconstructing 4 from 5. The prior for 6 is 7 (enforcing compact, task-relevant codes), and for 8 it is 9 (expanding residual space). A schematic (for one linear layer) is as follows:
6
Only 0 is used for forming output activations; 1 modulates the VAE loss and factorization but has no downstream effect during inference.
2. Training Objective and Information Factorization
The training objective couples the downstream task loss with a custom VAE-derived ELBO that explicitly encourages separation of task-driven and residual subspaces. For a single activation 2 at an adapted layer, define
3
with the repulsive cross-prior term
4
The coefficients 5 tune (respectively) reconstruction-vs-adaptation, information bottleneck on 6, and the repulsive force between the two latents. For a dataset 7, the total loss is
8
where 9 (image tasks) to 0 (text/audio) balances FVAE regularization with task loss.
The key innovation is the 1 term, which, through choice of nonoverlapping priors, creates a geometric “repulsion” between 2 and 3, improving disentanglement. Theoretically, 4 decomposes into a KL divergence difference, related to a lower bound on the Wasserstein distance 5.
3. Optimization Protocol and Algorithmic Implementation
Training proceeds using AdamW with linear warm-up and decay. Hyperparameters include 6 for all reported experiments; 7, 8, and 9 to balance capacity and bottleneck; learning rates range from 0 (ViT) to 1 (Llama/FVAE-LoRA) and 2 (LoRA baselines). For each batch, activations at adapted layers are encoded, sampled, decoded, and the FVAE loss is accumulated. The LoRA update is performed only via 3. Notably, 4-VAE annealing is unnecessary due to the effect of the 5 term.
Computation adds a ≈30% training-time overhead (due to the decoder path), but inference remains efficient, requiring only the 6 encoder. Regularization is task-dependent, with weight decay (0.01 for vision, 0.0 for LLMs) applied as appropriate.
4. Comparative Analysis with LoRA Variants
Standard LoRA treats the low-rank update 7 as a direct, fully trainable projection with no explicit separation of task-relevant and nuisance factors. Other PEFT methods (e.g., AdaLoRA, DoRA, rsLoRA) focus on adapting rank, magnitude, or singular value structure but do not address semantic disentanglement within the update. FVAE-LoRA is unique in learning two complementary latent spaces and promoting explicit geometric separation using VAEs and nontrivial priors, with only 8 driving downstream adaptation.
This semantic focus yields improved robustness to spurious correlations, as 9 absorbs signals not essential for the core task. Unlike some adapters, FVAE-LoRA modules cannot be merged back into 0, so the method remains a dynamic, input-dependent module at inference for enhanced robustness.
5. Empirical Performance Across Modalities
FVAE-LoRA was evaluated across vision, language, and audio domains. In all cases 1 was used.
Image Classification (ViT-B/16 backbone, q/k LoRA):
| Method | Params % | DTD | EuroSAT | GTSRB | RESISC45 | SUN397 | SVHN | Avg |
|---|---|---|---|---|---|---|---|---|
| Full-FT | – | 78.12 | 98.30 | 98.85 | 94.35 | 69.34 | 97.34 | 89.38 |
| LoRA | 0.72 | 74.65 | 97.28 | 96.95 | 90.11 | 71.11 | 94.22 | 87.39 |
| FVAE-LoRA | 0.73 | 78.19 | 97.78 | 97.98 | 93.57 | 73.14 | 96.55 | 89.53 |
Language (Llama-3-8B, commonsense reasoning):
| Method | Params % | PIQA | SIQA | ARC-c | ARC-e | OBQA | HellaSwag | WinoGrande | Avg |
|---|---|---|---|---|---|---|---|---|---|
| LoRA | 0.085 | 80.74 | 75.59 | 67.58 | 82.11 | 75.20 | 85.73 | 77.82 | 77.82 |
| HiRA | 0.085 | 88.63 | 80.40 | 81.66 | 93.56 | 87.20 | 94.48 | 85.87 | 87.40 |
| FVAE-LoRA | 0.085 | 88.96 | 81.58 | 81.06 | 92.72 | 86.20 | 95.30 | 88.95 | 87.82 |
GLUE (RoBERTa-base):
| Method | Params % | SST2 | CoLA | QNLI | MRPC | RTE | STSB | WNLI | Avg |
|---|---|---|---|---|---|---|---|---|---|
| Full-FT | – | 94.77 | 62.43 | 91.97 | 89.40 | 79.53 | 90.30 | 56.30 | 80.67 |
| LoRA | 0.47 | 93.97 | 59.60 | 91.87 | 88.73 | 77.87 | 88.90 | 57.73 | 79.81 |
| FVAE-LoRA | 0.48 | 94.10 | 60.37 | 91.63 | 89.53 | 79.90 | 88.60 | 64.33 | 81.21 |
Audio (TIMIT, Wav2Vec2-Large):
| Method | Params % | PER ↓ |
|---|---|---|
| Full-FT | – | 7.48 |
| LoRA | 0.50 | 9.38 |
| FVAE-LoRA | 0.50 | 8.09 |
In all domains, FVAE-LoRA matches or outperforms standard LoRA and, in some cases, full fine-tuning, despite training <1% of total parameters.
6. Robustness to Spurious Correlations and Worst-Group Generalization
Robustness was evaluated on benchmarks with train–test label/background decoupling: Animals (4 classes × 2 backgrounds), Waterbirds (land vs water), and CelebA (hair color vs gender). Metrics include worst-group (WG) accuracy, average (AVG) accuracy, and disparity 2.
| Method | Params % | ANIM.WG | ANIM.AVG | WATR.WG | WATR.AVG | CEL.WG | CEL.AVG | Disparity |
|---|---|---|---|---|---|---|---|---|
| LoRA | 0.72 | 54.8 | 88.2 | 75.5 | 90.4 | 40.0 | 96.1 | 34.8 |
| FVAE-LoRA | 0.73 | 62.0 | 89.6 | 75.8 | 91.0 | 43.3 | 95.8 | 31.7 |
By channeling causal features to 3 and spurious/environmental attributes to 4, FVAE-LoRA improves worst-group performance (notably +7% on Animals) and reduces performance disparity.
7. Practical Considerations, Limitations, and Future Directions
Recommended settings include adopting 5 for the low-rank subspace, specific Gaussian priors (6, 7), 8, 9, and tuning 0 to trade between over/underfitting (1 overfits; 2 underfits). The FVAE loss is weighted at 3 (image) or 4 (text/audio) relative to the primary task. Training learning rates and regularization mirror modality best practices; AdamW is recommended.
Limitations include current restriction to attention q/k projections. Extending factorized adaptation to feed-forward networks or value matrices may be beneficial. The FVAE-LoRA adapter cannot be statically merged with 5, so inference modules remain dynamic and input-dependent. Training time increases by ≈30%, but inference cost remains similar to LoRA due to the omission of decoder usage. Open questions involve further examining generalization under broader distribution shifts and adaptation paradigms.
FVAE-LoRA offers a framework in which a minimal two-latent VAE regularizes low-rank adapters for explicit semantic disambiguation. The resulting models exhibit improved robustness, reduced sensitivity to spurious correlations, and strong empirical performance across heterogeneous tasks and modalities (Kumar et al., 22 Oct 2025).