Khatri–Rao Product Adapters (KRAdapter)

Updated 25 February 2026

KRAdapter is a parameter-efficient fine-tuning method that uses the Khatri–Rao product to construct weight updates with enhanced effective rank.
It builds structured update matrices via two factor matrices, addressing limitations of conventional low-rank adaptation for improved spectral fidelity.
Empirical results demonstrate consistent performance gains over LoRA, with superior OOD generalization and maintained compute efficiency in vision and language tasks.

Khatri–Rao Product Adapters (KRAdapter) are a parameter-efficient fine-tuning (PEFT) method designed to improve the effective rank of learned weight updates when adapting large pretrained models. KRAdapter utilizes the Khatri–Rao product to construct update matrices that, by construction, tend to exhibit higher effective rank compared to the low-rank structure of methods such as LoRA. This approach addresses limitations of conventional low-rank adaptation in maintaining spectral properties critical for multimodal and LLMs, and demonstrates consistent improvements in out-of-distribution (OOD) generalization and performance on diverse benchmarks, while retaining the memory and compute efficiency characteristic of PEFT methods (Albert et al., 1 Aug 2025).

1. Mathematical Foundations: Khatri–Rao Product

The Khatri–Rao product is a column-wise Kronecker product of two matrices. If $U\in\mathbb{R}^{a\times c}$ and $V\in\mathbb{R}^{b\times c}$ , their Khatri–Rao product $U\odot V$ is defined as: $U\odot V = \begin{bmatrix} u_{11}v_1 & u_{12}v_2 & \cdots & u_{1c}v_c \ u_{21}v_1 & u_{22}v_2 & \cdots & u_{2c}v_c \ \vdots & \vdots & \ddots & \vdots \ u_{a1}v_1 & u_{a2}v_2 & \cdots & u_{ac}v_c \end{bmatrix} \in \mathbb{R}^{(ab)\times c}$ where $u_{ij}$ denotes the $(i,j)$ -entry of $U$ and $v_j$ denotes the $j$ th column of $V$ .

Unlike the standard matrix product, which for $U\in\mathbb{R}^{a\times d}$ and $V\in\mathbb{R}^{d\times c}$ yields a result in $\mathbb{R}^{a\times c}$ , and unlike the Kronecker product, which produces a block matrix in $\mathbb{R}^{(ab)\times(dc)}$ , the Khatri–Rao product stacks the column-wise Kronecker products, preserving a column-centric structure critical for high effective-rank updates.

2. Weight Update Construction in KRAdapter

KRAdapter modifies PEFT layers by expressing the fine-tuned weight as

$W \leftarrow W_0 + \Delta W$

with $W_0$ frozen and $\Delta W$ learned via a structured update. LoRA uses a low-rank factorization $\Delta W = B A$ with factors $A\in\mathbb{R}^{r\times d_{\mathrm{in}}}$ , $B\in\mathbb{R}^{d_{\mathrm{out}}\times r}$ . In contrast, KRAdapter introduces two factor matrices: $U\in\mathbb{R}^{k_1\times d_{\mathrm{in}}}, \quad V\in\mathbb{R}^{k_2\times d_{\mathrm{in}}}, \quad \text{with } k_1k_2 \geq d_{\mathrm{out}}$ and constructs the update as

$\Delta W = \alpha \cdot (U\odot V)_{[:\,d_{\mathrm{out}},:]} \in \mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}}$

where $\alpha$ is a scaling factor and the resulting matrix is truncated to the first $d_{\mathrm{out}}$ rows. Theoretical results demonstrate that any matrix $W$ of rank $r$ can be recast as $\mathrm{vec}(W) = (\bar{V} \odot \bar{U})\sigma$ for appropriate factorization, substantiating the expressive capacity of this parameterization.

3. Training and Implementation Protocol

Training with KRAdapter involves freezing pretrained weights and updating only the $U$ and $V$ factors. The procedure for each parameterized layer is:

U = zeros([k1, din])
V = KaimingUniform(-sqrt(1/k1), sqrt(1/k1), [k2, din])

for each step:
    x_in = input_activations
    delta_W = alpha * truncate_rows(U ⊙ V, dout)
    h_out = (W0 + delta_W) @ x_in
    # backpropagate loss
    update U and V (AdamW)

Initialization zeroes

U

and samples

V

from a Kaiming uniform distribution. Only

U

and

V

are trainable, preserving the efficiency of the underlying PEFT paradigm.

4. Effective Rank and Comparative Analysis

The effective rank of a matrix $M$ with singular values $\{s_i\}$ is given by

$r_{\mathrm{eff}}(M) = \exp\left(-\sum_i q_i \ln q_i\right),\quad q_i = \frac{s_i}{\sum_j s_j}$

A nearly flat singular value spectrum correlates with high effective rank.

KRAdapter achieves superior effective rank compared to LoRA and variants. Theoretical analysis indicates that for $U,V$ drawn i.i.d. (e.g., Gaussian or uniform) with $k^2 \ge d_{\mathrm{in}}$ , the Khatri–Rao product achieves full column rank almost surely, a property not shared by LoRA’s rank- $k$ construction.

Empirically, across all tested vision (ViT-B/32, L/14, H/14) and language (LLama3-8B, Qwen2.5-7B) heads, KRAdapter produces consistently higher effective ranks in adapter updates, supporting improved retention of complex spectral characteristics (Albert et al., 1 Aug 2025).

5. Empirical Results and Benchmark Performance

KRAdapter was benchmarked across several domains:

Synthetic Matrix Approximation: On six target matrix types (random, sparse, PCA-whitened, low-rank, CLIP-tuned, frequency-controlled), KRAdapter outperformed all PEFT baselines except when approximating explicitly low-rank targets, as measured by squared nuclear reconstruction error.
Vision-language Fine-tuning: On CLIP-ViT models across 11 classification datasets (few-shot and 50–100% data), KRAdapter achieved mean accuracies ∼5 points higher than LoRA and ∼1 point higher than other full-rank PEFT methods.
OOD Robustness: For ImageNet (in-distribution) and ImageNet‐A/S/R/V2 and CIFAR-100 (OOD), KRAdapter demonstrated the highest generalization ratio ( $r_{\mathrm{gen}}$ ), and minimal nuclear/Frobenius update norm.
LLM Commonsense: On 4-bit quantized LLama3.1-8B and Qwen2.5-7B, fine-tuned for multi-choice reasoning, KRAdapter yielded top OOD performance (e.g., BoolQ, PiQA, WinoGrande), while matching in-distribution performance.

For exhaustive dataset-level breakdowns, see Tables 1–3 and Appendix D–E in the original paper (Albert et al., 1 Aug 2025).

6. Memory Footprint and Computational Efficiency

KRAdapter introduces an update matrix $\Delta W$ parameterized by

$N_{\mathrm{KR}} = d_{\mathrm{in}}(k_1 + k_2), \quad \text{where } k_1k_2 \approx d_{\mathrm{out}}$

This is minimized for $k_1=k_2=\sqrt{d_{\mathrm{out}}}$ , yielding $N_{\mathrm{KR}} \approx 2\sqrt{d_{\mathrm{in}}d_{\mathrm{out}}}$ . This is significantly less than a full-rank update ( $d_{\mathrm{in}} d_{\mathrm{out}}$ ), and closely matches LoRA’s parameter count $N_{\mathrm{LoRA}}=2r d_{\mathrm{in}}$ for typical LoRA ranks ( $r \approx 16$ –32).

Floating point operation counts mirror those of LoRA. Empirical measurements (Appendix F, Table F.1) report nearly identical VRAM usage and epoch durations compared to LoRA, differing by only 1–2 minutes per epoch on transformers up to 8B parameters.

7. Implementation Guidelines and Limitations

Default hyperparameters:

Scaling factor $\alpha = 0.1$ (vision) or $\alpha = 2$ (LLM quant).
Learning rate: $10^{-2}$ for synthetic experiments, $10^{-4}$ (AdamW) for real tasks.
Initialization: $U \leftarrow 0$ , $V \sim$ KaimingUniform( $-\sqrt{1/k_1},+\sqrt{1/k_1}$ ).
Recommended shape: $k_1 = k_2 = \lfloor \sqrt{d_{\mathrm{out}}} \rfloor$ , rounded for $d_{\mathrm{out}}$ as needed.

Integration:

Implementable in PyTorch by subclassing nn.Linear or compatible attention modules, adding $U, V$ parameter blocks, and modifying the forward pass to inject $\Delta W$ .
Analogs in TensorFlow and JAX involve inserting a Khatri–Rao layer.
Official code is available at https://github.com/PaulAlbert31/KRAdapter.

Stability tips:

Zero-initialize $U$ to ensure $\Delta W$ begins at zero.
Use modest $\alpha$ to prevent disturbance of pretrained activations.
If numerical instability occurs, reduce learning rate or increase weight decay; optionally apply gradient norm clipping.

Limitations:

Minimal trainable parameters $\sim 2\sqrt{d_{\mathrm{in}} d_{\mathrm{out}}}$ exceed the LoRA rank-1 case.
KRAdapter is less optimal when the optimal update is genuinely low-rank (e.g., artificial low-rank matrices).
In pure in-distribution regimes with ample data, RandLoRA can occasionally match or slightly exceed KRAdapter on certain targets, but KRAdapter remains superior in OOD robustness and convergence speed.

KRAdapter preserves the operational simplicity and resource efficiency of LoRA. By leveraging the Khatri–Rao product, it constructs weight updates with much higher effective rank, leading to enhanced spectral fidelity and robustness in challenging multimodal and LLM adaptation tasks (Albert et al., 1 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Khatri–Rao Product Adapters (KRAdapter).