Gradient Multi-Subspace Tuning (GEMS)

Updated 21 January 2026

GEMS is a parameter-efficient fine-tuning framework that uses gradient multi-subspace decomposition to orthogonalize shared and task-specific updates, significantly reducing gradient conflicts.
It employs a null-space projection to preserve pre-trained semantics by constraining updates away from dominant general knowledge directions, avoiding knowledge drift.
Experimental evaluations on Qilin and Amazon datasets demonstrate substantial performance gains over existing baselines, highlighting its scalability and robustness for joint search and recommendation tuning.

Gradient Multi-Subspace Tuning (GEMS) is a parameter-efficient fine-tuning (PEFT) framework designed to unify item search and recommendation (S&R) using LLMs, addressing two fundamental problems: (1) task gradient conflicts from divergent optimization signals, and (2) undesirable drifts in the model's general-domain knowledge due to overfitting during multi-task adaptation. GEMS combines a multi-subspace gradient decomposition that orthogonalizes shared and task-specific updates, with a null-space projection that preserves pre-trained semantics by confining updates away from dominant general knowledge directions. These innovations collectively enable effective, scalable joint tuning of S&R, with extensive experimentation demonstrating significant empirical gains over prevailing baselines (Zhao et al., 14 Jan 2026).

1. Multi-Subspace Decomposition

GEMS disentangles optimization signals by decomposing gradients into low-rank subspaces corresponding to shared, search-specific, and recommendation-specific behaviors. At each step $t$ , for each LLM layer parameter $W_t \in \mathbb{R}^{m \times n}$ , and losses $\mathcal{L}_{\rm src}$ (search) and $\mathcal{L}_{\rm rec}$ (recommendation), GEMS computes three gradients:

$G_t^{\rm src} = -\nabla_{W_t}\mathcal{L}_{\rm src}$
$G_t^{\rm rec} = -\nabla_{W_t}\mathcal{L}_{\rm rec}$
$G_t^{\rm shared} = -\nabla_{W_t}(\mathcal{L}_{\rm src}+\mathcal{L}_{\rm rec})$

Every $T_{\rm svd}$ steps, each gradient matrix is decomposed by SVD:

$G_t^{(\text{type})} = U_t^{(\text{type})}\Sigma_t^{(\text{type})}(V_t^{(\text{type})})^\top$

where $\text{type} \in \{\text{shared}, \text{src}, \text{rec}\}$ . The subspace basis $U_{t,r}^{(\text{type})} \in \mathbb{R}^{m\times r}$ (top- $r$ singular vectors) defines a task’s effective adaptation region.

Gradients are projected into these subspaces:

$G_t^{(\text{type}, r)} = (U_{t,r}^{(\text{type})})^\top\,G_t^{(\text{type})} \in \mathbb{R}^{r\times n}$

Adam-style updates are performed in subspace ( $\Delta_t^{(\text{type},r)}$ ), and then mapped back:

$\Delta_t^{(\text{type})} = \alpha\,U_{t,r}^{(\text{type})}\,\Delta_t^{(\text{type},\,r)}$

with global scaling factor $\alpha$ .

A gating MLP, operating on normalized task loss ratios, gradient norms, and batch sizes, computes task weights $(\alpha_{\rm src}, \alpha_{\rm rec})$ ( $\alpha_{\rm src}+\alpha_{\rm rec}=1$ ), and the fused update is

$\Delta_t^{\rm fuse} = \Delta_t^{\rm shared} + \alpha_{\rm src}\,\Delta_t^{\rm src} + \alpha_{\rm rec}\,\Delta_t^{\rm rec}$

Because subspaces are constructed from distinct gradient statistics, their overlaps are minimal, and GEMS reduces destructive layerwise gradient conflict by more than 85% compared to LoRA.

2. Null-Space Projection

To avert degradation of broad-domain reasoning and preserve the LLM’s pre-trained intent understanding, GEMS projects updates onto the null-space of the principal directions derived from a large general-domain corpus $\mathcal{C}$ (e.g., Wikipedia).

Given representations $F \in \mathbb{R}^{n \times C}$ for each layer and $F F^\top = U_{\rm pre} \Sigma_{\rm pre} U_{\rm pre}^\top$ , the top- $k$ left-singular vectors $U_{\rm pre}^k \in \mathbb{R}^{n\times k}$ define the directions carrying most of the LLM's prior knowledge. GEMS constructs the null-space projector:

$P_\perp = I - U_{\rm pre}^k (U_{\rm pre}^k)^\top$

and applies this after gradient fusion:

$\Delta_t^{\rm final} = P_{\perp}\,\Delta_t^{\rm fuse}$

This operation excises update components that could overwrite global semantic competence, sharply reducing user intent “shifts” and halving the rate of “correct-before → incorrect-after” failures compared to LoRA.

3. Optimization Procedure

The GEMS optimization loop, performed per layer in parallel, comprises:

Sampling a minibatch with search and recommendation dual-task structure.
Computing task losses and their gradients.
Subspace decomposition and Adam updates within each subspace via SVD-derived projection.
Adaptive fusion of shared and task-specific step directions using a learned gating mechanism.
Null-space projection to confine updates away from general-domain knowledge vectors.
Weight parameter update: $W \leftarrow W + \eta \Delta_t^{\rm final}$ .

Gradient conflict is quantified via:

$\rho = 1 - \frac{g_{\rm src} \cdot g_{\rm rec}}{\|g_{\rm src}\|\,\|g_{\rm rec}\|}$

and is mitigated by the subspace routing.

Pseudocode:

Initialize W, t ← 0; M, V ← 0
SubspaceTune(type, G, t):
    if t mod T_svd == 0: recompute U_r (SVD)
    Project: G^(r) = U_r^T G
    Update Adam states in r
    Compute step Δ^(r), then Δ = α U_r Δ^(r)
    return Δ

NullProject(Δ, U_pre^k): return (I - U_pre^k (U_pre^k)^T) Δ

Repeat:
    Sample minibatch {(u,H_u,q,i^*)}
    Compute L_src, L_rec
    Backprop to get G_src, G_rec, G_shared
    Δ_src ← SubspaceTune(src, G_src, t)
    Δ_rec ← SubspaceTune(rec, G_rec, t)
    Δ_shared ← SubspaceTune(shared, G_shared, t)
    Fuse: Δ_fuse = Δ_shared + α_src Δ_src + α_rec Δ_rec
    Null-project: Δ_final = NullProject(Δ_fuse, U_pre^k)
    W ← W + η Δ_final; t ← t + 1
Until convergence

4. Experimental Validation

Datasets:

Qilin: Real S/R logs from a social-media platform (3,816 users, 275K items).
Amazon Electronics 5-Core: Synthetic queries with real purchase interactions (62K users, 158K items).

Evaluation Metrics: Hit@5, Hit@10, NDCG@5, NDCG@10, with each ranking over 100 candidates (1 positive, 99 negatives).

Baselines:

S/R: NCF, TIGER, LETTER (rec); ANCE, WebUltron, GenRet (search)
Unified S/R: UnifiedSSR, BSR, Sem-BSR, GenSAR
PEFT: LoRA, LoRA-MoE

Backbones:

Flan-T5-base (full fine-tuning)
Qwen2.5-3B (PEFT)

Method	Qilin H@5	Qilin H@10	Qilin N@5	Qilin N@10	Amazon H@5	Amazon H@10	Amazon N@5	Amazon N@10
Best specialized	0.2548	0.3052	0.1971	0.2091	0.2019	0.2584	0.1494	0.1675
GEMS	0.4285	0.5121	0.3251	0.3465	0.4025	0.5159	0.2975	0.3341

(*) All GEMS improvements are statistically significant ( $p<0.01$ ).

Ablation Study (Qilin):

Variant	Rec@5	Search@5
Full GEMS	0.4285	0.1511
– null-space	0.3782	0.1023
– multi-subspace	0.3294	0.0856
– both (subspace only)	0.3081	0.0732

Both the multi-subspace and null-space modules provide substantial, additive benefits.

5. Principles Underlying Effectiveness

GEMS achieves parameter-efficient, stable joint fine-tuning by addressing two critical optimization challenges:

Gradient conflict: Standard PEFT techniques (e.g., LoRA) allow cross-task gradient interference, impeding both tasks under divergent objectives. The multi-subspace decomposition routes updates primarily through non-overlapping adaptive subspaces, reducing destructive conflict (>85% less than LoRA).
Knowledge drift (“intent shift”): Fine-tuning for S&R can overwrite foundational language and intent understanding, causing pre- to post-tuning regression in global reasoning. The null-space projection method cuts the “correct→incorrect” degeneration rate by 11–15 percentage points.

No formal convergence guarantees are provided, but empirical evidence indicates GEMS’s architectural decomposition and projection mechanisms stabilize and enhance LLM-based multi-task learning without new parameters or inference cost.

6. Broader Context and Significance

By solving the S&R unification problem with a combination of adaptive subspace routing and pre-trained semantics preservation, GEMS advances both multi-task parameter-efficient fine-tuning methodologies and the practical deployment of LLMs in user-facing, dual-intent online systems. Its framework is positioned for adaptation to other scenarios where multi-objective interference and catastrophic forgetting are central. The demonstrated statistical and practical significance of its improvements on public and proprietary datasets establishes GEMS as a robust state-of-the-art PEFT protocol for unified LLM adaptation to heterogeneous downstream tasks (Zhao et al., 14 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Unifying Search and Recommendation in LLMs via Gradient Multi-Subspace Tuning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient Multi-Subspace Tuning (GEMS).

Gradient Multi-Subspace Tuning (GEMS)

1. Multi-Subspace Decomposition

2. Null-Space Projection

3. Optimization Procedure

4. Experimental Validation

5. Principles Underlying Effectiveness

6. Broader Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Gradient Multi-Subspace Tuning (GEMS)

1. Multi-Subspace Decomposition

2. Null-Space Projection

3. Optimization Procedure

4. Experimental Validation

5. Principles Underlying Effectiveness

6. Broader Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research