Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 84 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 86 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Kimi K2 229 tok/s Pro

2000 character limit reached

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models (2505.11711v1)

Published 16 May 2025 in cs.LG

Abstract: Reinforcement learning (RL) yields substantial improvements in LLMs downstream task performance and alignment with human values. Surprisingly, such large gains result from updating only a small subnetwork comprising just 5 percent to 30 percent of the parameters, with the rest effectively unchanged. We refer to this phenomenon as parameter update sparsity induced by RL. It is observed across all 7 widely used RL algorithms (e.g., PPO, GRPO, DPO) and all 10 LLMs from different families in our experiments. This sparsity is intrinsic and occurs without any explicit sparsity promoting regularizations or architectural constraints. Finetuning the subnetwork alone recovers the test accuracy, and, remarkably, produces a model nearly identical to the one obtained via full finetuning. The subnetworks from different random seeds, training data, and even RL algorithms show substantially greater overlap than expected by chance. Our analysis suggests that this sparsity is not due to updating only a subset of layers, instead, nearly all parameter matrices receive similarly sparse updates. Moreover, the updates to almost all parameter matrices are nearly full-rank, suggesting RL updates a small subset of parameters that nevertheless span almost the full subspaces that the parameter matrices can represent. We conjecture that the this update sparsity can be primarily attributed to training on data that is near the policy distribution, techniques that encourage the policy to remain close to the pretrained model, such as the KL regularization and gradient clipping, have limited impact.

Collections

Summary

The paper shows that RL finetuning updates only 5%-30% of LLM parameters, achieving near-identical performance to full finetuning.
It reveals that RL induces intrinsic sparsity across various LLMs and algorithms while preserving pretrained capabilities.
Subnetwork-only finetuning replicates full update results, implying efficiency gains and transferable parameter structures.

This paper investigates a surprising phenomenon in the Reinforcement Learning (RL) finetuning of LLMs: RL updates only a small subnetwork, typically 5%-30% of the total parameters, while the rest remain largely unchanged (2505.11711). This "RL-induced parameter update sparsity" is observed across 7 different RL algorithms (PPO, GRPO, DPO, KTO, SimPO, PRIME) and 10 LLMs. Notably, this sparsity arises intrinsically, without explicit regularization or architectural constraints.

A key conjecture proposed is that finetuning only this identified subnetwork, while keeping other parameters frozen, can produce a model nearly identical to one obtained through full finetuning, both in terms of performance and parameter values.

Key Findings and Contributions:

RL Induces Sparse Updates, SFT Induces Dense Ones:
- RL finetuning consistently leads to sparse parameter updates, often exceeding 70% sparsity (i.e., over 70% of parameters are unchanged). For example, Llama-3.1-Tulu-3-70B-DPO showed 95.2% sparsity, and DeepSeek-R1-Zero (trained directly with RL from a base model) showed 86.0% sparsity.
- In contrast, Supervised Fine-Tuning (SFT) tends to produce dense updates, with sparsity typically between 6%-15%.
- This suggests RL might better preserve pretrained capabilities by modifying fewer parameters.
Sparsity Characteristics:
- The sparse updates in RL are not concentrated in specific layers or components (like attention heads or MLP layers). Instead, nearly all parameter matrices (e.g., Q, K, V projections, MLP layers) receive similarly sparse updates.
- An exception is Layer Normalization layers, which receive very few or no updates.
- Despite the sparsity of the updates, these updates are almost always full-rank. This means RL modifies a small subset of parameters that span almost the full representational capacity of the parameter matrices, rather than constraining updates to a low-rank subspace (as LoRA does).
Subnetwork Sufficiency (Conjecture Support):
- Experiments with DPO and PRIME algorithms demonstrate that identifying the subnetwork updated during full RL finetuning and then retraining the model from scratch by only updating parameters within this subnetwork (masking gradients for parameters outside it) leads to:
  - Performance matching or even exceeding that of the full finetuned model. For DPO, an average +1.6 improvement was seen across tasks; for PRIME, a +2.4 average improvement on MATH500.
  - The resulting model parameters ( $\theta_{\text{sub}}$ ) being nearly identical to the parameters of the fully finetuned model ( $\theta_{\text{full}}$ ). For DPO, 94.0% of weights were the same (tolerance $10^{-5}$ ), and for PRIME, 90.5%. With a tolerance of $10^{-4}$ , parameters were 100% identical.
- This goes beyond the Lottery Ticket Hypothesis (LTH), which posits performance recovery; here, the exact model parameters are largely recovered.
Consistency of Subnetworks:
- Subnetworks identified through RL show substantial overlap even when varying:
  - Random seeds: Overlap of ~60% compared to a random baseline of ~36%.
  - Training data: Overlap of 26.7%-67.1% compared to random baselines of 14.6%-36.7%.
  - Seed, data, and RL algorithm simultaneously (stress test): Still notable overlaps (e.g., 59.1% for DPO subnetwork vs. PRIME subnetwork, compared to a 23.0% random baseline).
- This suggests a partially transferable structure within the pretrained model that RL consistently leverages.
Reasons for Sparsity:
- Training on in-distribution data is a primary driver. When the model learns from data similar to its current policy's output distribution (common in on-policy RL or when SFT precedes RL on the same data), fewer parameter changes are needed.
  - SFT on in-distribution data (e.g., using rejection sampling) also produces sparse updates (e.g., ~90% sparsity for Qwen2.5-Math-7B with RFT).
  - Conversely, DPO on out-of-distribution data (no prior SFT on that data) leads to dense updates (e.g., ~7% sparsity for Zephyr-7b-Beta).
- KL-divergence regularization and gradient clipping have limited impact on the overall update sparsity. Models trained without these still showed high sparsity.
- SFT before RL is not the main cause; models like DeepSeek-R1-Zero skip SFT and still show high RL update sparsity.
- Training duration: Sparsity tends to decrease with more training steps but appears to converge to a non-trivial level. For example, PRIME showed sparsity converging around 80%. DeepSeek-R1-Zero, despite 8K training steps, had 86% sparsity.
- Some parameters (around 8% in PRIME experiments) outside the final subnetwork receive non-zero gradients during training that eventually cancel out.

Practical Implications and Implementation:

Efficient RL Finetuning: The findings suggest potential for more efficient RL training. If the subnetwork can be identified early or predicted, training could focus only on these parameters, saving significant computational resources.

# Conceptual pseudocode for subnetwork-only finetuning
initial_model = load_pretrained_model()
full_rl_model = finetune_rl_full(initial_model, data, rl_hyperparams)

# Identify subnetwork mask
mask = (initial_model.params != full_rl_model.params) # Boolean mask where True means parameter changed

# Retrain with subnetwork-only updates
model_sub = load_pretrained_model() # Reset to initial_model
for epoch in range(num_epochs):
    for batch in data_loader:
        loss, grads = compute_loss_and_grads(model_sub, batch, rl_hyperparams)
        masked_grads = grads * mask # Apply mask
        optimizer.step(masked_grads)

Understanding Model Adaptation: The research provides insights into how LLMs adapt during RL. RL seems to find and refine specific "circuits" or pathways within the larger network.
Transferability: The consistency of subnetworks across different conditions suggests that knowledge about important parameters might be transferable, potentially speeding up finetuning for new, similar tasks or with different RL algorithms.
LoRA vs. Intrinsic Sparsity: While LoRA imposes low-rank updates, RL intrinsically finds sparse, full-rank updates. This suggests that current parameter-efficient finetuning methods (PEFTs) like LoRA might not fully capture the natural optimization path of RL. Future PEFTs could try to identify and train these sparse, full-rank subnetworks.
SFT vs. RL Parameter Updates: The stark difference (dense SFT updates vs. sparse RL updates) provides a quantifiable distinction in how these two popular finetuning paradigms alter the base model.

Implementation Details from Experiments:

Sparsity Calculation: Sparsity is $1 - (\text{number of non-zero elements in } (\theta^1 - \theta^0)) / n$ , where $\theta^0$ and $\theta^1$ are parameters before and after finetuning, and $n$ is the total number of parameters. A tolerance of $10^{-5}$ is used to define "non-zero" for bfloat16 values.
Models Analyzed: Publicly available checkpoints from Hugging Face for models like Tulu, Eurus, DeepSeek Math, and others finetuned with DPO, GRPO, ORPO, KTO, PPO, SimPO, PRIME.
Subnetwork Finetuning Setup:
- DPO: Implemented with Open-Instruct, LLaMA-3.1-Tulu-3-8B-SFT base model, batch size 128, LR $5 \times 10^{-7}$ , 1 epoch on allenai/llama-3.1-tulu-3-8b-preference-mixture.
- PRIME: Implemented with verl, Qwen2.5-Math-7B base model (Eurus-2-7B-SFT in paper, but Appendix mentions Qwen2.5 for PRIME hyperparams), batch size 64, actor LR $5 \times 10^{-7}$ , reward LR $1 \times 10^{-6}$ , 15 epochs on GSM8K and MATH.
Consistency Experiments: Controlled ablations on Llama-3.1-Tulu-3-8B-SFT, varying seeds, using Tulu preference data vs. PRIME rollout data.

The paper concludes that training on in-distribution data is a key reason for this sparsity, opening avenues for more efficient RLHF strategies that leverage this intrinsic property of LLM finetuning.