Papers
Topics
Authors
Recent
Search
2000 character limit reached

Minor Component Adaptation (MiCA)

Updated 3 July 2026
  • MiCA is a parameter-efficient fine-tuning technique that restricts model adaptation to the minor singular subspace of weight matrices, enhancing knowledge transfer.
  • It employs spectral decomposition to isolate least significant singular vectors, reducing adapter parameters compared to methods like LoRA.
  • Empirical evaluations show up to a 5.9-fold improvement in knowledge acquisition, demonstrating robust performance in domain-specific adaptation.

Minor Component Adaptation (MiCA) is a parameter-efficient fine-tuning technique for LLMs that restricts model adaptation to the minor singular subspace of pre-trained weight matrices. Unlike approaches such as Low-Rank Adaptation (LoRA) that target dominant (major) singular components, MiCA leverages the least significant singular vectors—subspaces typically underutilized by standard pre-training—to enable more efficient, stable knowledge injection with a reduced adapter parameter footprint. Empirical evidence demonstrates up to a 5.9-fold improvement in knowledge acquisition relative to LoRA under optimal hyperparameters, while requiring only 6–60% of the adapter parameters used by LoRA (Rüdiger et al., 2 Apr 2026).

1. Theoretical Basis and Notation

MiCA is rooted in the spectral decomposition of transformer layer weights. A weight matrix WRd×dW \in \mathbb{R}^{d \times d} (with possible generalization to WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}) is decomposed via Singular Value Decomposition (SVD) as W=UΣVW = U \Sigma V^\top, where URd×dU \in \mathbb{R}^{d \times d} and VRd×dV \in \mathbb{R}^{d \times d} are orthogonal, and Σ=diag(σ1,,σd)\Sigma = \mathrm{diag}(\sigma_1, \ldots, \sigma_d) contains singular values sorted in descending order. The subspace spanned by the bottom-rr left singular vectors (udr+1,,udu_{d-r+1}, \ldots, u_d) defines the minor singular subspace. This low-energy region is conventionally under-utilized by pre-trained models.

2. MiCA Algorithmic Formulation

MiCA constrains all updates during fine-tuning to the minor singular subspace:

  • Subspace Selection: Fix a rank rdr \ll d and extract Uminor=U[:,dr+1:d]Rd×rU_{\text{minor}} = U[:, d-r+1:d] \in \mathbb{R}^{d \times r}, the matrix of minor left singular vectors.
  • Adapter Parameterization: Introduce a trainable coefficient matrix WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}0, initialized to zero, and freeze WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}1 (denoted as WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}2).
  • Update Rule: The adaptation to WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}3 is constrained as

WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}4

with global scaling WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}5 (typically WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}6, so WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}7). The fine-tuned weight is WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}8, ensuring WRdout×dinW \in \mathbb{R}^{d_\text{out} \times d_\text{in}}9 has rank at most W=UΣVW = U \Sigma V^\top0 and is contained entirely within the minor subspace.

3. Optimization and Hyperparameter Regimes

MiCA fine-tuning involves grid search over:

  • Rank W=UΣVW = U \Sigma V^\top1: Typical values include 16, 32, 128
  • Learning Rate W=UΣVW = U \Sigma V^\top2: e.g., W=UΣVW = U \Sigma V^\top3, W=UΣVW = U \Sigma V^\top4, W=UΣVW = U \Sigma V^\top5
  • Epochs W=UΣVW = U \Sigma V^\top6: e.g., 4 or 8
  • Scaling W=UΣVW = U \Sigma V^\top7: Usually set to W=UΣVW = U \Sigma V^\top8
  • Optional: LoRA-style dropout, weight decay, warmup ratio

During optimization, both the original weight W=UΣVW = U \Sigma V^\top9 and basis URd×dU \in \mathbb{R}^{d \times d}0 are frozen; only URd×dU \in \mathbb{R}^{d \times d}1 is updated (using AdamW with weight decay URd×dU \in \mathbb{R}^{d \times d}2 and a cosine learning-rate schedule). Cross-entropy serves as the training loss for language modeling or multiple-choice QA. The maximum gradient norm is 1.0, and training precision is bfloat16 or bf16. No regularization is applied beyond the intrinsic rank constraint imposed by the parameterization.

4. Empirical Evaluation and Comparative Analysis

Downstream Tasks

MiCA’s effectiveness was evaluated on pre-training and factual knowledge transfer in two principal benchmarks:

  • BLOGS dataset: Continued pre-training on 30 paraphrased blog posts, evaluated on BLOGS-MC (300 GPT-4-generated multiple-choice questions), TruthfulQA, and HellaSwag.
  • HISTORY dataset: Training on a 100,000-token German history monograph, with evaluation on HISTORY-MC (102 questions) and HellaSwag.

Methods Compared

  • Full Fine-Tuning (Full FT): Updating all model parameters
  • LoRA: Standard low-rank adaptation, optimizing URd×dU \in \mathbb{R}^{d \times d}3 and URd×dU \in \mathbb{R}^{d \times d}4
  • MiCA: Only URd×dU \in \mathbb{R}^{d \times d}5 is trained, with URd×dU \in \mathbb{R}^{d \times d}6 frozen

Parameter and Compute Analysis

Model Total Params LoRA Adapter Params MiCA Adapter Params MiCA/LoRA %
Llama-2-7B 6,747M 67M (URd×dU \in \mathbb{R}^{d \times d}7) 4M (URd×dU \in \mathbb{R}^{d \times d}8) 6%
Qwen2.5-7B 7,626M 10M (URd×dU \in \mathbb{R}^{d \times d}9) 6M (VRd×dV \in \mathbb{R}^{d \times d}0) 60%

Performance Results

Method Model BLOGS-MC TruthfulQA HellaSwag VRd×dV \in \mathbb{R}^{d \times d}1 LR Epochs Params
Baseline Llama-2-7B-chat 56.18 34.79 60.40 6,747M
LoRA (optimal) Llama-2-7B 58.28 35.47 60.41 128 1e-4 8 67M
MiCA (optimal) Llama-2-7B 61.33 35.29 60.11 16 5e-4 4 4M
Baseline Qwen2.5-7B 72.91 43.27 60.60 7,626M
LoRA (optimal) Qwen2.5-7B 73.87 42.95 60.95 32 5e-4 4 10M
MiCA (optimal) Qwen2.5-7B 75.63 43.38 61.62 32 5e-4 8 6M

MiCA achieves a 3-point absolute gain on BLOGS-MC over LoRA for Llama-2-7B and a 1.8-point gain for Qwen2.5-7B, using as little as 6–60% the number of adapter parameters. Abstractly, MiCA demonstrates up to a 5.9-fold improvement in knowledge acquisition under optimized hyperparameters relative to LoRA, with a significantly reduced parameter footprint (Rüdiger et al., 2 Apr 2026).

5. Ablation, Convergence, and Spectral Insights

Empirical analysis confirms that updates in the minor singular subspace are particularly effective for domain-specific knowledge injection:

  • Spectral Grounding: Confining updates to low-energy directions (minor singular vectors) helps prevent overwriting dominant model components that encode generic pre-trained knowledge.
  • Empirical Stability: MiCA’s learning curves indicate more rapid and stable convergence compared to LoRA or random subspace baselines.
  • Ablation Study (Qwen-2.5-7B, VRd×dV \in \mathbb{R}^{d \times d}2=32):

| Adaptation | BLOGS-MC Accuracy | |--------------------------|-------------------| | No FT (Instruct) | 72.91 | | Major-r Adaptation | 74.21 | | Random Subspace (VRd×dV \in \mathbb{R}^{d \times d}3) | 73.75 | | Minor-r (MiCA) | 75.63 |

Minor singular directions outperform both major and random subspace adaptations, supporting the hypothesis that the least expressive directions are best suited for domain-specific adaptation.

6. Implementation and Practical Considerations

MiCA requires only a single SVD per layer, after which the minor component basis VRd×dV \in \mathbb{R}^{d \times d}4 remains frozen. Pseudocode for a single-layer adaptation is:

VRd×dV \in \mathbb{R}^{d \times d}7

Integration is straightforward with transformer frameworks (Hugging Face, PEFT), requiring only replacement of LoRA modules with MiCA modules in VRd×dV \in \mathbb{R}^{d \times d}5 and VRd×dV \in \mathbb{R}^{d \times d}6 matrices. SVD and minor vector extraction incur only a one-time cost per layer. MiCA’s parameter and computational efficiency makes it amenable to federated learning and on-device adaptation settings.

7. Summary and Significance

Minor Component Adaptation is a parameter-efficient fine-tuning methodology that exploits the latent capacity of minor singular directions in model weight matrices. By focusing adaptation within these subspaces, MiCA integrates new factual knowledge more efficiently than both full fine-tuning and LoRA, with empirically demonstrated superiority in both learning efficiency and model stability. The constraint to minor singular directions prevents interference with core model capabilities, providing an effective mechanism for domain adaptation with a minimal parameter and compute footprint (Rüdiger et al., 2 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Minor Component Adaptation (MiCA).