Papers
Topics
Authors
Recent
Search
2000 character limit reached

GRASP LoRA: Guided Adapter Sparsity via GRPO

Updated 17 January 2026
  • The paper proposes GRASP LoRA, a method that leverages GRPO for dynamically optimizing adapter sparsity, achieving up to 7× faster fine-tuning on cross-lingual tasks.
  • GRASP LoRA is a parameter-efficient approach that merges English and target language LoRA adapters and employs magnitude-based pruning with learnable global prune ratios.
  • Experimental results demonstrate improved metrics in summarization and QA tasks while significantly reducing computational costs, highlighting its practical efficiency.

GRASP LoRA (GRPO Guided Adapter Sparsity Policy) is a parameter-efficient fine-tuning methodology designed for cross-lingual transfer of LLMs under limited computational and data resources. Unlike conventional adapter pruning pipelines that rely on grid search over sparsity ratios—an approach both resource-intensive and coarse—GRASP LoRA transforms the global sparsity ratio into a learnable control variable optimized online by a Group-Relative Policy Optimization (GRPO) controller using minimal development data (Hassan et al., 10 Jan 2026).

1. GRPO Controller: Mathematical Formulation and Optimization

At the core of GRASP LoRA is a stochastic policy for the global prune ratio ρ[pmin,pmax][0,1]\rho \in [p_{\min}, p_{\max}] \subseteq [0, 1], parameterized by a univariate Gaussian: ρπμ,σ=N(μ,σ2),μ[pmin,pmax],  σ>0.\rho \sim \pi_{\mu,\sigma} = \mathcal{N}(\mu, \sigma^2), \qquad \mu\in[p_{\min},p_{\max}],\; \sigma>0. Every KK optimizer steps, the controller samples CC candidate prune ratios {ρi}i=1C\{\rho_i\}_{i=1}^C as ziN(μ,σ2)z_i \sim \mathcal{N}(\mu, \sigma^2) and ρi=clamp(zi,pmin,pmax)\rho_i = \mathrm{clamp}(z_i, p_{\min}, p_{\max}). Magnitude-thresholded binary masks M(ρi)M(\rho_i) are constructed over the merged LoRA weights W~\widetilde{W}, each inducing a pruned subnetwork. Candidate losses i\ell_i and a baseline loss base\ell_{\text{base}} are evaluated on a fixed micro development slice of mm target-language examples: R(ρi)=i+baseR(\rho_i) = -\ell_i + \ell_{\text{base}} serves as the reward signal. The policy is optimized via the GRPO surrogate: J(μ,σ)=Eρπμ,σ[R(ρ)]β2(μρcurr)2+τH(πμ,σ)J(\mu, \sigma) = \mathbb{E}_{\rho \sim \pi_{\mu,\sigma}} [R(\rho)] - \frac{\beta}{2} (\mu - \rho_{\text{curr}})^2 + \tau H(\pi_{\mu,\sigma}) where β\beta regularizes towards the current ratio and τ\tau encourages entropy. Score-function gradients gμg_\mu, gσg_\sigma are computed according to centered advantages, and parameter updates are constrained within admissible bounds. Commitment to a new prune ratio occurs only if no micro-dev loss increase is observed, with bounded step size Δmax\Delta_{\max}.

2. End-to-End Algorithmic Workflow

GRASP LoRA interleaves adapter fine-tuning, policy-guided pruning, and evaluation in three main phases:

  1. Adapter Training and Merging
    • English LoRA adapters are trained on high-resource English data with the backbone model frozen.
    • Target-language LoRA adapters are trained on low-resource target language data, again with a frozen backbone.
    • The adapters are merged by summing their low-rank update matrices at each projection site: W~=ΔWen+ΔWtgt\widetilde{W} = \Delta W_{\text{en}} + \Delta W_{\text{tgt}}.
  2. Sparsity Policy Learning (Controller Rounds)
    • Initialize ρcurr=pinit\rho_{\text{curr}} = p_{\text{init}}, with controller parameters set to (μ,σ)(\mu, \sigma).
    • In each controller round:
      • Fine-tune the merged adapters on target data under the current mask M(ρcurr)M(\rho_{\text{curr}}).
      • Every KK steps, probe CC candidate prune ratios, evaluate their corresponding micro-dev loss, and update the controller policy using Eqs. (1)-(2).
      • Commit to a new ρcurr\rho_{\text{curr}} if improvement is observed, update masks, and clear optimizer states for new zeros.
    • Upon completion, select ρ\rho^\star via post-hoc validation loss minimization.
  3. Final Pruning and Fine-tuning
    • Reload the frozen backbone and pre-controller merged adapters.
    • Build the final mask M(ρ)M(\rho^\star) and fine-tune the masked model on the full target data until early stopping on a held-out dev set.

3. Adapter Merging and Magnitude-based Pruning

LoRA adapter merging is performed by summing the low-rank update matrices for source and target languages: ΔWmerge=a{en,tgt}BaAa\Delta W_{\text{merge}} = \sum_{a\in\{\text{en},\,\text{tgt}\}} B_a A_a Pruning is implemented tensor-wise: for prune ratio ρ\rho, the mask is determined by retaining the top (1ρ)(1-\rho) fraction of (magnitude-ordered) entries per tensor. Let dtd_t be the number of parameters in tensor tt, kt=ρdtk_t = \lfloor \rho d_t \rfloor, and τt(ρ)\tau_t(\rho) the ktk_tth order statistic of W~(t)|\widetilde{W}^{(t)}|. Then,

Mj(t)(ρ)={1if W~j(t)>τt(ρ) 0otherwiseM^{(t)}_j(\rho) = \begin{cases} 1 & \text{if } |\widetilde{W}^{(t)}_j| > \tau_t(\rho) \ 0 & \text{otherwise} \end{cases}

This mask is applied to all adapters on the (frozen) backbone model.

4. Experimental Protocol and Hyperparameters

The evaluation covers cross-lingual transfer for summarization and extractive QA:

  • Datasets:
    • XL-Sum (English→Arabic, English→Chinese): English train/dev 10k/1k; Arabic/Chinese train 50, dev 50, micro 16, test 100.
    • MLQA (extractive QA): English train 3k; Arabic/Chinese train 50, micro 16, test 100.
  • Model and PEFT Setup:
    • Backbone: Llama 3 8B (frozen).
    • LoRA applied to Q and V projections; rank 8, α=32\alpha=32, dropout 0.05.
  • Optimization and Controller:
    • Adapter fine-tuning: 10 epochs, lr 1e41\mathrm{e}{-4}, AdamW, batch size 1, max input 2200 tokens.
    • GRPO settings: prune range [0.10,0.80][0.10, 0.80], pinit=0.40p_{\text{init}} = 0.40, probe interval K=10K=10, C=3C=3 candidates, micro-dev m=16m=16, Δmax=0.10\Delta_{\max}=0.10, controller lr $0.05$.
    • Regularization (β,τ)(\beta,\tau) is tuned: for Arabic XL-Sum (0.04,0.02)(0.04, 0.02), Chinese XL-Sum (0.04,0.01)(0.04, 0.01), Arabic MLQA (0.04,0.01)(0.04, 0.01), Chinese MLQA (0.05,0.01)(0.05, 0.01).
  • Evaluation Metrics:
    • Summarization: BERTScore-F1, BLEU-4, ROUGE-L (and additional variants).
    • QA: BERTScore-F1, Exact Match, token F1 (plus BLEU/ROUGE/chrF for spans).
  • Prompt Structure: Unchanged across languages; e.g., for summarization, "Article:{article} → Summary:"; for QA, "Context:{context} Question:{question} → Answer:".

5. Empirical Results and Performance Analysis

GRASP LoRA demonstrates consistent improvements over strong merge-and-prune grid search baselines:

Task Baseline Prune GRASP Prune BERT-F1 Δ BLEU-4 Δ ROUGE-L Δ EM Δ F1 Δ Time Δ
XL-Sum Arabic 70% 67.49% +0.88 +1.75 +2.13 3.90× faster
XL-Sum Chinese 50% 56.94% +1.62 +1.73 +1.45 5.66× faster
MLQA Arabic 40% 48.97% +0.56 +2.67 +2.22 6.40× faster
MLQA Chinese 10% 23.73% +1.98 +1.50 +0.67 7.45× faster

Additional findings include:

  • The approach reduces end-to-end runtime by a factor of 4–7× compared to grid search baselines.
  • Improvements are robust with respect to micro-dev size, with ρ\rho^\star and BERT-F1 stable across m{4,8,16,32}m \in \{4, 8, 16, 32\}.
  • Regularization ablations show that removing entropy or mean anchoring leads to excessive pruning (~79%) and a 1–2 point drop in evaluation metrics.
  • Qualitative analysis shows superior semantic faithfulness in summarization and more accurate answer extraction in QA compared to baselines.

6. Implementation Considerations and Practical Implications

Key features for faithful and efficient deployment include:

  • Use of a small, fixed micro-dev slice (16 examples) for all controller evaluations, independent of the early-stop dev set.
  • All controller reward computations, pruning evaluations, and commitment decisions are logged, enabling post-hoc ρ\rho^\star selection if needed.
  • Identical prompt templates and consistent adapter architectures facilitate experimentation across unrelated linguistic domains.
  • The learnable sparsity policy makes it feasible to select fractional sparsity optima impractical under conventional discrete grid search, especially in low-resource settings.

A plausible implication is that GRASP LoRA extends reliable adapter reuse to previously intractable low-resource regimes by decoupling sparsity hyperparameter tuning from costly grid search. The method offers a systematic pathway for tuning adapter sparsity using only minimal dev resources, providing improved model quality, content coverage, and answer quality relative to strong baselines and yielding major reductions in both computational and annotation costs (Hassan et al., 10 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GRASP LoRA (GRPO Guided Adapter Sparsity Policy).