Papers
Topics
Authors
Recent
Search
2000 character limit reached

Khatri–Rao Product Adapters (KRAdapter)

Updated 25 February 2026
  • KRAdapter is a parameter-efficient fine-tuning method that uses the Khatri–Rao product to construct weight updates with enhanced effective rank.
  • It builds structured update matrices via two factor matrices, addressing limitations of conventional low-rank adaptation for improved spectral fidelity.
  • Empirical results demonstrate consistent performance gains over LoRA, with superior OOD generalization and maintained compute efficiency in vision and language tasks.

Khatri–Rao Product Adapters (KRAdapter) are a parameter-efficient fine-tuning (PEFT) method designed to improve the effective rank of learned weight updates when adapting large pretrained models. KRAdapter utilizes the Khatri–Rao product to construct update matrices that, by construction, tend to exhibit higher effective rank compared to the low-rank structure of methods such as LoRA. This approach addresses limitations of conventional low-rank adaptation in maintaining spectral properties critical for multimodal and LLMs, and demonstrates consistent improvements in out-of-distribution (OOD) generalization and performance on diverse benchmarks, while retaining the memory and compute efficiency characteristic of PEFT methods (Albert et al., 1 Aug 2025).

1. Mathematical Foundations: Khatri–Rao Product

The Khatri–Rao product is a column-wise Kronecker product of two matrices. If URa×cU\in\mathbb{R}^{a\times c} and VRb×cV\in\mathbb{R}^{b\times c}, their Khatri–Rao product UVU\odot V is defined as: UV=[u11v1u12v2u1cvc u21v1u22v2u2cvc  ua1v1ua2v2uacvc]R(ab)×cU\odot V = \begin{bmatrix} u_{11}v_1 & u_{12}v_2 & \cdots & u_{1c}v_c \ u_{21}v_1 & u_{22}v_2 & \cdots & u_{2c}v_c \ \vdots & \vdots & \ddots & \vdots \ u_{a1}v_1 & u_{a2}v_2 & \cdots & u_{ac}v_c \end{bmatrix} \in \mathbb{R}^{(ab)\times c} where uiju_{ij} denotes the (i,j)(i,j)-entry of UU and vjv_j denotes the jjth column of VV.

Unlike the standard matrix product, which for URa×dU\in\mathbb{R}^{a\times d} and VRd×cV\in\mathbb{R}^{d\times c} yields a result in Ra×c\mathbb{R}^{a\times c}, and unlike the Kronecker product, which produces a block matrix in R(ab)×(dc)\mathbb{R}^{(ab)\times(dc)}, the Khatri–Rao product stacks the column-wise Kronecker products, preserving a column-centric structure critical for high effective-rank updates.

2. Weight Update Construction in KRAdapter

KRAdapter modifies PEFT layers by expressing the fine-tuned weight as

WW0+ΔWW \leftarrow W_0 + \Delta W

with W0W_0 frozen and ΔW\Delta W learned via a structured update. LoRA uses a low-rank factorization ΔW=BA\Delta W = B A with factors ARr×dinA\in\mathbb{R}^{r\times d_{\mathrm{in}}}, BRdout×rB\in\mathbb{R}^{d_{\mathrm{out}}\times r}. In contrast, KRAdapter introduces two factor matrices: URk1×din,VRk2×din,with k1k2doutU\in\mathbb{R}^{k_1\times d_{\mathrm{in}}}, \quad V\in\mathbb{R}^{k_2\times d_{\mathrm{in}}}, \quad \text{with } k_1k_2 \geq d_{\mathrm{out}} and constructs the update as

ΔW=α(UV)[:dout,:]Rdout×din\Delta W = \alpha \cdot (U\odot V)_{[:\,d_{\mathrm{out}},:]} \in \mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}}

where α\alpha is a scaling factor and the resulting matrix is truncated to the first doutd_{\mathrm{out}} rows. Theoretical results demonstrate that any matrix WW of rank rr can be recast as vec(W)=(VˉUˉ)σ\mathrm{vec}(W) = (\bar{V} \odot \bar{U})\sigma for appropriate factorization, substantiating the expressive capacity of this parameterization.

3. Training and Implementation Protocol

Training with KRAdapter involves freezing pretrained weights and updating only the UU and VV factors. The procedure for each parameterized layer is:

1
2
3
4
5
6
7
8
9
U = zeros([k1, din])
V = KaimingUniform(-sqrt(1/k1), sqrt(1/k1), [k2, din])

for each step:
    x_in = input_activations
    delta_W = alpha * truncate_rows(U  V, dout)
    h_out = (W0 + delta_W) @ x_in
    # backpropagate loss
    update U and V (AdamW)
Initialization zeroes UU and samples VV from a Kaiming uniform distribution. Only UU and VV are trainable, preserving the efficiency of the underlying PEFT paradigm.

4. Effective Rank and Comparative Analysis

The effective rank of a matrix MM with singular values {si}\{s_i\} is given by

reff(M)=exp(iqilnqi),qi=sijsjr_{\mathrm{eff}}(M) = \exp\left(-\sum_i q_i \ln q_i\right),\quad q_i = \frac{s_i}{\sum_j s_j}

A nearly flat singular value spectrum correlates with high effective rank.

KRAdapter achieves superior effective rank compared to LoRA and variants. Theoretical analysis indicates that for U,VU,V drawn i.i.d. (e.g., Gaussian or uniform) with k2dink^2 \ge d_{\mathrm{in}}, the Khatri–Rao product achieves full column rank almost surely, a property not shared by LoRA’s rank-kk construction.

Empirically, across all tested vision (ViT-B/32, L/14, H/14) and language (LLama3-8B, Qwen2.5-7B) heads, KRAdapter produces consistently higher effective ranks in adapter updates, supporting improved retention of complex spectral characteristics (Albert et al., 1 Aug 2025).

5. Empirical Results and Benchmark Performance

KRAdapter was benchmarked across several domains:

  • Synthetic Matrix Approximation: On six target matrix types (random, sparse, PCA-whitened, low-rank, CLIP-tuned, frequency-controlled), KRAdapter outperformed all PEFT baselines except when approximating explicitly low-rank targets, as measured by squared nuclear reconstruction error.
  • Vision-language Fine-tuning: On CLIP-ViT models across 11 classification datasets (few-shot and 50–100% data), KRAdapter achieved mean accuracies ∼5 points higher than LoRA and ∼1 point higher than other full-rank PEFT methods.
  • OOD Robustness: For ImageNet (in-distribution) and ImageNet‐A/S/R/V2 and CIFAR-100 (OOD), KRAdapter demonstrated the highest generalization ratio (rgenr_{\mathrm{gen}}), and minimal nuclear/Frobenius update norm.
  • LLM Commonsense: On 4-bit quantized LLama3.1-8B and Qwen2.5-7B, fine-tuned for multi-choice reasoning, KRAdapter yielded top OOD performance (e.g., BoolQ, PiQA, WinoGrande), while matching in-distribution performance.

For exhaustive dataset-level breakdowns, see Tables 1–3 and Appendix D–E in the original paper (Albert et al., 1 Aug 2025).

6. Memory Footprint and Computational Efficiency

KRAdapter introduces an update matrix ΔW\Delta W parameterized by

NKR=din(k1+k2),where k1k2doutN_{\mathrm{KR}} = d_{\mathrm{in}}(k_1 + k_2), \quad \text{where } k_1k_2 \approx d_{\mathrm{out}}

This is minimized for k1=k2=doutk_1=k_2=\sqrt{d_{\mathrm{out}}}, yielding NKR2dindoutN_{\mathrm{KR}} \approx 2\sqrt{d_{\mathrm{in}}d_{\mathrm{out}}}. This is significantly less than a full-rank update (dindoutd_{\mathrm{in}} d_{\mathrm{out}}), and closely matches LoRA’s parameter count NLoRA=2rdinN_{\mathrm{LoRA}}=2r d_{\mathrm{in}} for typical LoRA ranks (r16r \approx 16–32).

Floating point operation counts mirror those of LoRA. Empirical measurements (Appendix F, Table F.1) report nearly identical VRAM usage and epoch durations compared to LoRA, differing by only 1–2 minutes per epoch on transformers up to 8B parameters.

7. Implementation Guidelines and Limitations

Default hyperparameters:

  • Scaling factor α=0.1\alpha = 0.1 (vision) or α=2\alpha = 2 (LLM quant).
  • Learning rate: 10210^{-2} for synthetic experiments, 10410^{-4} (AdamW) for real tasks.
  • Initialization: U0U \leftarrow 0, VV \sim KaimingUniform(1/k1,+1/k1-\sqrt{1/k_1},+\sqrt{1/k_1}).
  • Recommended shape: k1=k2=doutk_1 = k_2 = \lfloor \sqrt{d_{\mathrm{out}}} \rfloor, rounded for doutd_{\mathrm{out}} as needed.

Integration:

  • Implementable in PyTorch by subclassing nn.Linear or compatible attention modules, adding U,VU, V parameter blocks, and modifying the forward pass to inject ΔW\Delta W.
  • Analogs in TensorFlow and JAX involve inserting a Khatri–Rao layer.
  • Official code is available at https://github.com/PaulAlbert31/KRAdapter.

Stability tips:

  • Zero-initialize UU to ensure ΔW\Delta W begins at zero.
  • Use modest α\alpha to prevent disturbance of pretrained activations.
  • If numerical instability occurs, reduce learning rate or increase weight decay; optionally apply gradient norm clipping.

Limitations:

  • Minimal trainable parameters 2dindout\sim 2\sqrt{d_{\mathrm{in}} d_{\mathrm{out}}} exceed the LoRA rank-1 case.
  • KRAdapter is less optimal when the optimal update is genuinely low-rank (e.g., artificial low-rank matrices).
  • In pure in-distribution regimes with ample data, RandLoRA can occasionally match or slightly exceed KRAdapter on certain targets, but KRAdapter remains superior in OOD robustness and convergence speed.

KRAdapter preserves the operational simplicity and resource efficiency of LoRA. By leveraging the Khatri–Rao product, it constructs weight updates with much higher effective rank, leading to enhanced spectral fidelity and robustness in challenging multimodal and LLM adaptation tasks (Albert et al., 1 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Khatri–Rao Product Adapters (KRAdapter).