Papers
Topics
Authors
Recent
Search
2000 character limit reached

Gradient Multi-Subspace Tuning (GEMS)

Updated 21 January 2026
  • GEMS is a parameter-efficient fine-tuning framework that uses gradient multi-subspace decomposition to orthogonalize shared and task-specific updates, significantly reducing gradient conflicts.
  • It employs a null-space projection to preserve pre-trained semantics by constraining updates away from dominant general knowledge directions, avoiding knowledge drift.
  • Experimental evaluations on Qilin and Amazon datasets demonstrate substantial performance gains over existing baselines, highlighting its scalability and robustness for joint search and recommendation tuning.

Gradient Multi-Subspace Tuning (GEMS) is a parameter-efficient fine-tuning (PEFT) framework designed to unify item search and recommendation (S&R) using LLMs, addressing two fundamental problems: (1) task gradient conflicts from divergent optimization signals, and (2) undesirable drifts in the model's general-domain knowledge due to overfitting during multi-task adaptation. GEMS combines a multi-subspace gradient decomposition that orthogonalizes shared and task-specific updates, with a null-space projection that preserves pre-trained semantics by confining updates away from dominant general knowledge directions. These innovations collectively enable effective, scalable joint tuning of S&R, with extensive experimentation demonstrating significant empirical gains over prevailing baselines (Zhao et al., 14 Jan 2026).

1. Multi-Subspace Decomposition

GEMS disentangles optimization signals by decomposing gradients into low-rank subspaces corresponding to shared, search-specific, and recommendation-specific behaviors. At each step tt, for each LLM layer parameter WtRm×nW_t \in \mathbb{R}^{m \times n}, and losses Lsrc\mathcal{L}_{\rm src} (search) and Lrec\mathcal{L}_{\rm rec} (recommendation), GEMS computes three gradients:

  • Gtsrc=WtLsrcG_t^{\rm src} = -\nabla_{W_t}\mathcal{L}_{\rm src}
  • Gtrec=WtLrecG_t^{\rm rec} = -\nabla_{W_t}\mathcal{L}_{\rm rec}
  • Gtshared=Wt(Lsrc+Lrec)G_t^{\rm shared} = -\nabla_{W_t}(\mathcal{L}_{\rm src}+\mathcal{L}_{\rm rec})

Every TsvdT_{\rm svd} steps, each gradient matrix is decomposed by SVD:

Gt(type)=Ut(type)Σt(type)(Vt(type))G_t^{(\text{type})} = U_t^{(\text{type})}\Sigma_t^{(\text{type})}(V_t^{(\text{type})})^\top

where type{shared,src,rec}\text{type} \in \{\text{shared}, \text{src}, \text{rec}\}. The subspace basis Ut,r(type)Rm×rU_{t,r}^{(\text{type})} \in \mathbb{R}^{m\times r} (top-rr singular vectors) defines a task’s effective adaptation region.

Gradients are projected into these subspaces:

Gt(type,r)=(Ut,r(type))Gt(type)Rr×nG_t^{(\text{type}, r)} = (U_{t,r}^{(\text{type})})^\top\,G_t^{(\text{type})} \in \mathbb{R}^{r\times n}

Adam-style updates are performed in subspace (Δt(type,r)\Delta_t^{(\text{type},r)}), and then mapped back:

Δt(type)=αUt,r(type)Δt(type,r)\Delta_t^{(\text{type})} = \alpha\,U_{t,r}^{(\text{type})}\,\Delta_t^{(\text{type},\,r)}

with global scaling factor α\alpha.

A gating MLP, operating on normalized task loss ratios, gradient norms, and batch sizes, computes task weights (αsrc,αrec)(\alpha_{\rm src}, \alpha_{\rm rec}) (αsrc+αrec=1\alpha_{\rm src}+\alpha_{\rm rec}=1), and the fused update is

Δtfuse=Δtshared+αsrcΔtsrc+αrecΔtrec\Delta_t^{\rm fuse} = \Delta_t^{\rm shared} + \alpha_{\rm src}\,\Delta_t^{\rm src} + \alpha_{\rm rec}\,\Delta_t^{\rm rec}

Because subspaces are constructed from distinct gradient statistics, their overlaps are minimal, and GEMS reduces destructive layerwise gradient conflict by more than 85% compared to LoRA.

2. Null-Space Projection

To avert degradation of broad-domain reasoning and preserve the LLM’s pre-trained intent understanding, GEMS projects updates onto the null-space of the principal directions derived from a large general-domain corpus C\mathcal{C} (e.g., Wikipedia).

Given representations FRn×CF \in \mathbb{R}^{n \times C} for each layer and FF=UpreΣpreUpreF F^\top = U_{\rm pre} \Sigma_{\rm pre} U_{\rm pre}^\top, the top-kk left-singular vectors UprekRn×kU_{\rm pre}^k \in \mathbb{R}^{n\times k} define the directions carrying most of the LLM's prior knowledge. GEMS constructs the null-space projector:

P=IUprek(Uprek)P_\perp = I - U_{\rm pre}^k (U_{\rm pre}^k)^\top

and applies this after gradient fusion:

Δtfinal=PΔtfuse\Delta_t^{\rm final} = P_{\perp}\,\Delta_t^{\rm fuse}

This operation excises update components that could overwrite global semantic competence, sharply reducing user intent “shifts” and halving the rate of “correct-before → incorrect-after” failures compared to LoRA.

3. Optimization Procedure

The GEMS optimization loop, performed per layer in parallel, comprises:

  1. Sampling a minibatch with search and recommendation dual-task structure.
  2. Computing task losses and their gradients.
  3. Subspace decomposition and Adam updates within each subspace via SVD-derived projection.
  4. Adaptive fusion of shared and task-specific step directions using a learned gating mechanism.
  5. Null-space projection to confine updates away from general-domain knowledge vectors.
  6. Weight parameter update: WW+ηΔtfinalW \leftarrow W + \eta \Delta_t^{\rm final}.

Gradient conflict is quantified via:

ρ=1gsrcgrecgsrcgrec\rho = 1 - \frac{g_{\rm src} \cdot g_{\rm rec}}{\|g_{\rm src}\|\,\|g_{\rm rec}\|}

and is mitigated by the subspace routing.

Pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Initialize W, t ← 0; M, V ← 0
SubspaceTune(type, G, t):
    if t mod T_svd == 0: recompute U_r (SVD)
    Project: G^(r) = U_r^T G
    Update Adam states in r
    Compute step Δ^(r), then Δ = α U_r Δ^(r)
    return Δ

NullProject(Δ, U_pre^k): return (I - U_pre^k (U_pre^k)^T) Δ

Repeat:
    Sample minibatch {(u,H_u,q,i^*)}
    Compute L_src, L_rec
    Backprop to get G_src, G_rec, G_shared
    Δ_src ← SubspaceTune(src, G_src, t)
    Δ_rec ← SubspaceTune(rec, G_rec, t)
    Δ_shared ← SubspaceTune(shared, G_shared, t)
    Fuse: Δ_fuse = Δ_shared + α_src Δ_src + α_rec Δ_rec
    Null-project: Δ_final = NullProject(Δ_fuse, U_pre^k)
    W ← W + η Δ_final; t ← t + 1
Until convergence

4. Experimental Validation

Datasets:

  • Qilin: Real S/R logs from a social-media platform (3,816 users, 275K items).
  • Amazon Electronics 5-Core: Synthetic queries with real purchase interactions (62K users, 158K items).

Evaluation Metrics: Hit@5, Hit@10, NDCG@5, NDCG@10, with each ranking over 100 candidates (1 positive, 99 negatives).

Baselines:

  • S/R: NCF, TIGER, LETTER (rec); ANCE, WebUltron, GenRet (search)
  • Unified S/R: UnifiedSSR, BSR, Sem-BSR, GenSAR
  • PEFT: LoRA, LoRA-MoE

Backbones:

  • Flan-T5-base (full fine-tuning)
  • Qwen2.5-3B (PEFT)
Method Qilin H@5 Qilin H@10 Qilin N@5 Qilin N@10 Amazon H@5 Amazon H@10 Amazon N@5 Amazon N@10
Best specialized 0.2548 0.3052 0.1971 0.2091 0.2019 0.2584 0.1494 0.1675
GEMS 0.4285 0.5121 0.3251 0.3465 0.4025 0.5159 0.2975 0.3341

(*) All GEMS improvements are statistically significant (p<0.01p<0.01).

Ablation Study (Qilin):

Variant Rec@5 Search@5
Full GEMS 0.4285 0.1511
– null-space 0.3782 0.1023
– multi-subspace 0.3294 0.0856
– both (subspace only) 0.3081 0.0732

Both the multi-subspace and null-space modules provide substantial, additive benefits.

5. Principles Underlying Effectiveness

GEMS achieves parameter-efficient, stable joint fine-tuning by addressing two critical optimization challenges:

  • Gradient conflict: Standard PEFT techniques (e.g., LoRA) allow cross-task gradient interference, impeding both tasks under divergent objectives. The multi-subspace decomposition routes updates primarily through non-overlapping adaptive subspaces, reducing destructive conflict (>85% less than LoRA).
  • Knowledge drift (“intent shift”): Fine-tuning for S&R can overwrite foundational language and intent understanding, causing pre- to post-tuning regression in global reasoning. The null-space projection method cuts the “correct→incorrect” degeneration rate by 11–15 percentage points.

No formal convergence guarantees are provided, but empirical evidence indicates GEMS’s architectural decomposition and projection mechanisms stabilize and enhance LLM-based multi-task learning without new parameters or inference cost.

6. Broader Context and Significance

By solving the S&R unification problem with a combination of adaptive subspace routing and pre-trained semantics preservation, GEMS advances both multi-task parameter-efficient fine-tuning methodologies and the practical deployment of LLMs in user-facing, dual-intent online systems. Its framework is positioned for adaptation to other scenarios where multi-objective interference and catastrophic forgetting are central. The demonstrated statistical and practical significance of its improvements on public and proprietary datasets establishes GEMS as a robust state-of-the-art PEFT protocol for unified LLM adaptation to heterogeneous downstream tasks (Zhao et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gradient Multi-Subspace Tuning (GEMS).