Papers
Topics
Authors
Recent
Search
2000 character limit reached

FVAE-LoRA: Fine-Tuning via Latent Factorization

Updated 5 June 2026
  • The paper introduces FVAE-LoRA, which integrates a dual-latent VAE into the LoRA framework to explicitly separate task-salient from residual features.
  • It employs a tailored ELBO objective with repulsive priors to enforce latent-space factorization, improving robustness to spurious correlations and distribution shifts.
  • Empirical results across vision, language, and audio tasks demonstrate that FVAE-LoRA outperforms standard LoRA while keeping parameter increases minimal.

FVAE-LoRA is a parameter-efficient fine-tuning (PEFT) framework that augments the standard Low-Rank Adaptation (LoRA) method with an explicit latent-space factorization. By embedding a small variational autoencoder (VAE) at adaptation layers, FVAE-LoRA learns to disentangle task-salient from residual features in the low-rank update space, leading to increased robustness and downstream generalization, especially under spurious correlations and distribution shifts. Unlike standard LoRA variants, FVAE-LoRA utilizes a tailored Evidence Lower Bound (ELBO) objective to enforce this factorization without increasing inference-time parameter budgets (Kumar et al., 22 Oct 2025).

1. Architectural Design and Latent Factorization

FVAE-LoRA operates by replacing the static LoRA low-rank update mechanism with an encoder-decoder (VAE) pipeline. Given a frozen linear transformation WRk×dW \in \mathbb{R}^{k \times d} and input activation xRdx \in \mathbb{R}^{d}, the standard LoRA approach augments with W+BAW + BA for rank rr factors ARr×dA \in \mathbb{R}^{r \times d} and BRk×rB \in \mathbb{R}^{k \times r}. FVAE-LoRA introduces two parallel encoders qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}, each mapping xx to latent codes z1,z2Rrz_1,z_2 \in \mathbb{R}^r:

  • z1z_1 parameterizes the low-rank update through xRdx \in \mathbb{R}^{d}0, driving the downstream task.
  • xRdx \in \mathbb{R}^{d}1 captures residual or non-task-salient information.

Each encoder is a two-layer MLP producing diagonal-Gaussian posteriors: xRdx \in \mathbb{R}^{d}2. The decoder xRdx \in \mathbb{R}^{d}3 is a shallow MLP reconstructing xRdx \in \mathbb{R}^{d}4 from xRdx \in \mathbb{R}^{d}5. The prior for xRdx \in \mathbb{R}^{d}6 is xRdx \in \mathbb{R}^{d}7 (enforcing compact, task-relevant codes), and for xRdx \in \mathbb{R}^{d}8 it is xRdx \in \mathbb{R}^{d}9 (expanding residual space). A schematic (for one linear layer) is as follows:

qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}6

Only W+BAW + BA0 is used for forming output activations; W+BAW + BA1 modulates the VAE loss and factorization but has no downstream effect during inference.

2. Training Objective and Information Factorization

The training objective couples the downstream task loss with a custom VAE-derived ELBO that explicitly encourages separation of task-driven and residual subspaces. For a single activation W+BAW + BA2 at an adapted layer, define

W+BAW + BA3

with the repulsive cross-prior term

W+BAW + BA4

The coefficients W+BAW + BA5 tune (respectively) reconstruction-vs-adaptation, information bottleneck on W+BAW + BA6, and the repulsive force between the two latents. For a dataset W+BAW + BA7, the total loss is

W+BAW + BA8

where W+BAW + BA9 (image tasks) to rr0 (text/audio) balances FVAE regularization with task loss.

The key innovation is the rr1 term, which, through choice of nonoverlapping priors, creates a geometric “repulsion” between rr2 and rr3, improving disentanglement. Theoretically, rr4 decomposes into a KL divergence difference, related to a lower bound on the Wasserstein distance rr5.

3. Optimization Protocol and Algorithmic Implementation

Training proceeds using AdamW with linear warm-up and decay. Hyperparameters include rr6 for all reported experiments; rr7, rr8, and rr9 to balance capacity and bottleneck; learning rates range from ARr×dA \in \mathbb{R}^{r \times d}0 (ViT) to ARr×dA \in \mathbb{R}^{r \times d}1 (Llama/FVAE-LoRA) and ARr×dA \in \mathbb{R}^{r \times d}2 (LoRA baselines). For each batch, activations at adapted layers are encoded, sampled, decoded, and the FVAE loss is accumulated. The LoRA update is performed only via ARr×dA \in \mathbb{R}^{r \times d}3. Notably, ARr×dA \in \mathbb{R}^{r \times d}4-VAE annealing is unnecessary due to the effect of the ARr×dA \in \mathbb{R}^{r \times d}5 term.

Computation adds a ≈30% training-time overhead (due to the decoder path), but inference remains efficient, requiring only the ARr×dA \in \mathbb{R}^{r \times d}6 encoder. Regularization is task-dependent, with weight decay (0.01 for vision, 0.0 for LLMs) applied as appropriate.

4. Comparative Analysis with LoRA Variants

Standard LoRA treats the low-rank update ARr×dA \in \mathbb{R}^{r \times d}7 as a direct, fully trainable projection with no explicit separation of task-relevant and nuisance factors. Other PEFT methods (e.g., AdaLoRA, DoRA, rsLoRA) focus on adapting rank, magnitude, or singular value structure but do not address semantic disentanglement within the update. FVAE-LoRA is unique in learning two complementary latent spaces and promoting explicit geometric separation using VAEs and nontrivial priors, with only ARr×dA \in \mathbb{R}^{r \times d}8 driving downstream adaptation.

This semantic focus yields improved robustness to spurious correlations, as ARr×dA \in \mathbb{R}^{r \times d}9 absorbs signals not essential for the core task. Unlike some adapters, FVAE-LoRA modules cannot be merged back into BRk×rB \in \mathbb{R}^{k \times r}0, so the method remains a dynamic, input-dependent module at inference for enhanced robustness.

5. Empirical Performance Across Modalities

FVAE-LoRA was evaluated across vision, language, and audio domains. In all cases BRk×rB \in \mathbb{R}^{k \times r}1 was used.

Image Classification (ViT-B/16 backbone, q/k LoRA):

Method Params % DTD EuroSAT GTSRB RESISC45 SUN397 SVHN Avg
Full-FT 78.12 98.30 98.85 94.35 69.34 97.34 89.38
LoRA 0.72 74.65 97.28 96.95 90.11 71.11 94.22 87.39
FVAE-LoRA 0.73 78.19 97.78 97.98 93.57 73.14 96.55 89.53

Language (Llama-3-8B, commonsense reasoning):

Method Params % PIQA SIQA ARC-c ARC-e OBQA HellaSwag WinoGrande Avg
LoRA 0.085 80.74 75.59 67.58 82.11 75.20 85.73 77.82 77.82
HiRA 0.085 88.63 80.40 81.66 93.56 87.20 94.48 85.87 87.40
FVAE-LoRA 0.085 88.96 81.58 81.06 92.72 86.20 95.30 88.95 87.82

GLUE (RoBERTa-base):

Method Params % SST2 CoLA QNLI MRPC RTE STSB WNLI Avg
Full-FT 94.77 62.43 91.97 89.40 79.53 90.30 56.30 80.67
LoRA 0.47 93.97 59.60 91.87 88.73 77.87 88.90 57.73 79.81
FVAE-LoRA 0.48 94.10 60.37 91.63 89.53 79.90 88.60 64.33 81.21

Audio (TIMIT, Wav2Vec2-Large):

Method Params % PER
Full-FT 7.48
LoRA 0.50 9.38
FVAE-LoRA 0.50 8.09

In all domains, FVAE-LoRA matches or outperforms standard LoRA and, in some cases, full fine-tuning, despite training <1% of total parameters.

6. Robustness to Spurious Correlations and Worst-Group Generalization

Robustness was evaluated on benchmarks with train–test label/background decoupling: Animals (4 classes × 2 backgrounds), Waterbirds (land vs water), and CelebA (hair color vs gender). Metrics include worst-group (WG) accuracy, average (AVG) accuracy, and disparity BRk×rB \in \mathbb{R}^{k \times r}2.

Method Params % ANIM.WG ANIM.AVG WATR.WG WATR.AVG CEL.WG CEL.AVG Disparity
LoRA 0.72 54.8 88.2 75.5 90.4 40.0 96.1 34.8
FVAE-LoRA 0.73 62.0 89.6 75.8 91.0 43.3 95.8 31.7

By channeling causal features to BRk×rB \in \mathbb{R}^{k \times r}3 and spurious/environmental attributes to BRk×rB \in \mathbb{R}^{k \times r}4, FVAE-LoRA improves worst-group performance (notably +7% on Animals) and reduces performance disparity.

7. Practical Considerations, Limitations, and Future Directions

Recommended settings include adopting BRk×rB \in \mathbb{R}^{k \times r}5 for the low-rank subspace, specific Gaussian priors (BRk×rB \in \mathbb{R}^{k \times r}6, BRk×rB \in \mathbb{R}^{k \times r}7), BRk×rB \in \mathbb{R}^{k \times r}8, BRk×rB \in \mathbb{R}^{k \times r}9, and tuning qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}0 to trade between over/underfitting (qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}1 overfits; qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}2 underfits). The FVAE loss is weighted at qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}3 (image) or qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}4 (text/audio) relative to the primary task. Training learning rates and regularization mirror modality best practices; AdamW is recommended.

Limitations include current restriction to attention q/k projections. Extending factorized adaptation to feed-forward networks or value matrices may be beneficial. The FVAE-LoRA adapter cannot be statically merged with qϕ1,qϕ2q_{\phi_1}, q_{\phi_2}5, so inference modules remain dynamic and input-dependent. Training time increases by ≈30%, but inference cost remains similar to LoRA due to the omission of decoder usage. Open questions involve further examining generalization under broader distribution shifts and adaptation paradigms.

FVAE-LoRA offers a framework in which a minimal two-latent VAE regularizes low-rank adapters for explicit semantic disambiguation. The resulting models exhibit improved robustness, reduced sensitivity to spurious correlations, and strong empirical performance across heterogeneous tasks and modalities (Kumar et al., 22 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FVAE-LoRA.