- The paper shows that RL finetuning updates only 5%-30% of LLM parameters, achieving near-identical performance to full finetuning.
- It reveals that RL induces intrinsic sparsity across various LLMs and algorithms while preserving pretrained capabilities.
- Subnetwork-only finetuning replicates full update results, implying efficiency gains and transferable parameter structures.
This paper investigates a surprising phenomenon in the Reinforcement Learning (RL) finetuning of LLMs: RL updates only a small subnetwork, typically 5%-30% of the total parameters, while the rest remain largely unchanged (2505.11711). This "RL-induced parameter update sparsity" is observed across 7 different RL algorithms (PPO, GRPO, DPO, KTO, SimPO, PRIME) and 10 LLMs. Notably, this sparsity arises intrinsically, without explicit regularization or architectural constraints.
A key conjecture proposed is that finetuning only this identified subnetwork, while keeping other parameters frozen, can produce a model nearly identical to one obtained through full finetuning, both in terms of performance and parameter values.
Key Findings and Contributions:
- RL Induces Sparse Updates, SFT Induces Dense Ones:
- RL finetuning consistently leads to sparse parameter updates, often exceeding 70% sparsity (i.e., over 70% of parameters are unchanged). For example, Llama-3.1-Tulu-3-70B-DPO showed 95.2% sparsity, and DeepSeek-R1-Zero (trained directly with RL from a base model) showed 86.0% sparsity.
- In contrast, Supervised Fine-Tuning (SFT) tends to produce dense updates, with sparsity typically between 6%-15%.
- This suggests RL might better preserve pretrained capabilities by modifying fewer parameters.
- Sparsity Characteristics:
- The sparse updates in RL are not concentrated in specific layers or components (like attention heads or MLP layers). Instead, nearly all parameter matrices (e.g., Q, K, V projections, MLP layers) receive similarly sparse updates.
- An exception is Layer Normalization layers, which receive very few or no updates.
- Despite the sparsity of the updates, these updates are almost always full-rank. This means RL modifies a small subset of parameters that span almost the full representational capacity of the parameter matrices, rather than constraining updates to a low-rank subspace (as LoRA does).
- Subnetwork Sufficiency (Conjecture Support):
- Experiments with DPO and PRIME algorithms demonstrate that identifying the subnetwork updated during full RL finetuning and then retraining the model from scratch by only updating parameters within this subnetwork (masking gradients for parameters outside it) leads to:
- Performance matching or even exceeding that of the full finetuned model. For DPO, an average +1.6 improvement was seen across tasks; for PRIME, a +2.4 average improvement on MATH500.
- The resulting model parameters (θsub) being nearly identical to the parameters of the fully finetuned model (θfull). For DPO, 94.0% of weights were the same (tolerance 10−5), and for PRIME, 90.5%. With a tolerance of 10−4, parameters were 100% identical.
- This goes beyond the Lottery Ticket Hypothesis (LTH), which posits performance recovery; here, the exact model parameters are largely recovered.
- Consistency of Subnetworks:
- Subnetworks identified through RL show substantial overlap even when varying:
- Random seeds: Overlap of ~60% compared to a random baseline of ~36%.
- Training data: Overlap of 26.7%-67.1% compared to random baselines of 14.6%-36.7%.
- Seed, data, and RL algorithm simultaneously (stress test): Still notable overlaps (e.g., 59.1% for DPO subnetwork vs. PRIME subnetwork, compared to a 23.0% random baseline).
- This suggests a partially transferable structure within the pretrained model that RL consistently leverages.
- Reasons for Sparsity:
- Training on in-distribution data is a primary driver. When the model learns from data similar to its current policy's output distribution (common in on-policy RL or when SFT precedes RL on the same data), fewer parameter changes are needed.
- SFT on in-distribution data (e.g., using rejection sampling) also produces sparse updates (e.g., ~90% sparsity for Qwen2.5-Math-7B with RFT).
- Conversely, DPO on out-of-distribution data (no prior SFT on that data) leads to dense updates (e.g., ~7% sparsity for Zephyr-7b-Beta).
- KL-divergence regularization and gradient clipping have limited impact on the overall update sparsity. Models trained without these still showed high sparsity.
- SFT before RL is not the main cause; models like DeepSeek-R1-Zero skip SFT and still show high RL update sparsity.
- Training duration: Sparsity tends to decrease with more training steps but appears to converge to a non-trivial level. For example, PRIME showed sparsity converging around 80%. DeepSeek-R1-Zero, despite 8K training steps, had 86% sparsity.
- Some parameters (around 8% in PRIME experiments) outside the final subnetwork receive non-zero gradients during training that eventually cancel out.
Practical Implications and Implementation:
- Efficient RL Finetuning: The findings suggest potential for more efficient RL training. If the subnetwork can be identified early or predicted, training could focus only on these parameters, saving significant computational resources.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# Conceptual pseudocode for subnetwork-only finetuning
initial_model = load_pretrained_model()
full_rl_model = finetune_rl_full(initial_model, data, rl_hyperparams)
# Identify subnetwork mask
mask = (initial_model.params != full_rl_model.params) # Boolean mask where True means parameter changed
# Retrain with subnetwork-only updates
model_sub = load_pretrained_model() # Reset to initial_model
for epoch in range(num_epochs):
for batch in data_loader:
loss, grads = compute_loss_and_grads(model_sub, batch, rl_hyperparams)
masked_grads = grads * mask # Apply mask
optimizer.step(masked_grads) |
- Understanding Model Adaptation: The research provides insights into how LLMs adapt during RL. RL seems to find and refine specific "circuits" or pathways within the larger network.
- Transferability: The consistency of subnetworks across different conditions suggests that knowledge about important parameters might be transferable, potentially speeding up finetuning for new, similar tasks or with different RL algorithms.
- LoRA vs. Intrinsic Sparsity: While LoRA imposes low-rank updates, RL intrinsically finds sparse, full-rank updates. This suggests that current parameter-efficient finetuning methods (PEFTs) like LoRA might not fully capture the natural optimization path of RL. Future PEFTs could try to identify and train these sparse, full-rank subnetworks.
- SFT vs. RL Parameter Updates: The stark difference (dense SFT updates vs. sparse RL updates) provides a quantifiable distinction in how these two popular finetuning paradigms alter the base model.
Implementation Details from Experiments:
- Sparsity Calculation: Sparsity is 1−(number of non-zero elements in (θ1−θ0))/n, where θ0 and θ1 are parameters before and after finetuning, and n is the total number of parameters. A tolerance of 10−5 is used to define "non-zero" for bfloat16 values.
- Models Analyzed: Publicly available checkpoints from Hugging Face for models like Tulu, Eurus, DeepSeek Math, and others finetuned with DPO, GRPO, ORPO, KTO, PPO, SimPO, PRIME.
- Subnetwork Finetuning Setup:
- DPO: Implemented with Open-Instruct, LLaMA-3.1-Tulu-3-8B-SFT base model, batch size 128, LR 5×10−7, 1 epoch on
allenai/llama-3.1-tulu-3-8b-preference-mixture
.
- PRIME: Implemented with verl, Qwen2.5-Math-7B base model (Eurus-2-7B-SFT in paper, but Appendix mentions Qwen2.5 for PRIME hyperparams), batch size 64, actor LR 5×10−7, reward LR 1×10−6, 15 epochs on GSM8K and MATH.
- Consistency Experiments: Controlled ablations on Llama-3.1-Tulu-3-8B-SFT, varying seeds, using Tulu preference data vs. PRIME rollout data.
The paper concludes that training on in-distribution data is a key reason for this sparsity, opening avenues for more efficient RLHF strategies that leverage this intrinsic property of LLM finetuning.