- The paper demonstrates that modified GRPO-based RL fine-tuning significantly enhances reasoning accuracy in small LLMs using a targeted math dataset.
- It employs a balanced curriculum of problem difficulty and a composite reward system addressing accuracy, response length, and format.
- The approach achieves notable performance gains with minimal cost while highlighting challenges like optimization instability and token limitations.
Overview
The paper provides a systematic investigation into reinforcement learning (RL) fine-tuning for reasoning in small LLMs, exemplified by the 1.5B parameter DeepSeek-R1-Distill-Qwen-1.5B model. The objective was to augment its reasoning capabilities in a computationally constrained setting using a variant of Group Relative Policy Optimization (GRPO). Performance improvements were validated on standard reasoning benchmarks using a compact, carefully curated mathematical reasoning dataset. Numerical results indicate significant improvements in reasoning accuracy accompanied by notable cost efficiencies.
Effective RL Strategies
Data Curation and Problem Mixture
A cornerstone of the approach is the use of a high-quality, narrowly focused dataset tailored for mathematical reasoning. Data was selectively filtered from the s1 and DeepScaleR corpora to ensure problem difficulty and domain relevance. The strategy of mixing easier problems with more challenging ones provided stable learning dynamics. This balanced curriculum enabled early improvements and maintained robustness in error backpropagation.
Group Relative Policy Optimization (GRPO) Adaptation
The modified GRPO algorithm eschews the need for a separate critic network by computing baselines from aggregated group scores. This group-wise strategy not only economizes computational resources but also simplifies tuning. More specifically, the approach leverages a rule-based reward system which comprises three distinct elements:
- Accuracy Reward: A binary signal contingent on the correctness of the final answer (formatted with an unambiguous ⋅ representation).
- Cosine Reward: A scaling factor linked to response length, imposing a bias towards more concise and precise outputs.
- Format Reward: A structural incentive rewarding outputs that encapsulate reasoning within clearly delineated tokens (typically
>
and ``).
This composite reward structure was integral to fine-tuning the model’s behavior, leading to rapid increases in task-specific metrics.
Numerical Gains and Cost Efficiency
Key numerical results underscore the practical benefits of the RL approach:
- AMC23 Accuracy: Increased from 63% to 80% post RL fine-tuning.
- AIME24 Performance: Achieved 46.7%, exceeding baseline benchmarks such as o1-preview.
The method attained these improvements using only 7,000 training samples with six outputs per sample, incurring an approximate training cost of $42—orders of magnitude lower than conventional methods requiring extensive computational budgets.
Limitations and Challenges in Implementation
Optimization Instability
The fine-tuning process exhibited sensitivity to prolonged training, manifesting as an optimization instability. After extended epochs, the model occasionally demonstrates performance degradation and unpredictability in output sequences, including the emergence of multilingual artifacts. This suggests that sustained policy updates can lead to mode collapse or reward misalignment.
Tokenization and Length Constraints
The training regimen imposed strict token limits (originally 4096 tokens, later reduced to 3584) which proved problematic for capturing complete reasoning paths in complex problem instances. The enforced truncation limits inhibit the model's capacity to articulate extended chains of reasoning, necessitating further innovations in dynamic token management or segmented reasoning frameworks.
Base Model Multilinguality
An additional challenge stems from the underlying multilingual pre-training of the model. Despite dedicated prompt engineering aimed at enforcing unilingual (English) outputs, latent multilingual properties sometimes cause inadvertent language drifts. This phenomenon underscores the need for more robust language control mechanisms during RL fine-tuning.
Implementation Considerations
Computational Architecture
The experiments were conducted using a cluster of 4 NVIDIA A40 GPUs (48 GB VRAM each) over a 24-hour training period. For practitioners intending to replicate or extend these results, a similar constrained environment should be considered. Techniques such as mixed precision training and gradient checkpointing can be leveraged to further optimize resource usage.
Pseudocode for Training Procedure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
|
initialize model: DeepSeek-R1-Distill-Qwen-1.5B
initialize dataset: curated_math_reasoning_dataset
initialize GRPO parameters and reward function weights
for epoch in num_epochs:
for batch in dataset:
# Run the forward pass to generate outputs
outputs = model(batch.inputs)
# Compute rewards
acc_reward = compute_accuracy_reward(outputs, batch.targets)
cosine_reward = compute_cosine_reward(outputs, desired_length=optimal_length)
format_reward = compute_format_reward(outputs)
total_reward = acc_reward + cosine_reward + format_reward
# Group Relative Policy Update
group_scores = form_grouped_scores(outputs, total_reward)
baseline = compute_baseline(group_scores)
policy_loss = compute_policy_loss(outputs, group_scores, baseline)
# Backpropagation and parameter update
optimizer.zero_grad()
policy_loss.backward()
optimizer.step()
evaluate_model(model, validation_set) |
Trade-Offs and Scaling Considerations
The design choices reflect critical trade-offs between performance gains and stability. While RL fine-tuning delivers rapid improvements in reasoning metrics under constrained conditions, challenges like optimization instability and token limitations impose limits on the scalability of this method. Advanced proposals for future work might involve:
- Hybrid approaches that incorporate supervised learning components periodically to recalibrate the model.
- Adaptive token budgeting strategies that allow longer responses for complex reasoning tasks.
- Enhanced reward formulations that mitigate language drift and promote output consistency.
Conclusion
The paper demonstrates that RL fine-tuning via a modified GRPO framework can enhance the reasoning abilities of small LLMs efficiently. Key factors contributing to these improvements include targeted data curation, a nuanced reward system, and the strategic mixing of problem difficulties. Despite robust numerical advances (e.g., AMC23 accuracy improvement from 63% to 80%), the approach encounters challenges such as optimization instability, inadequate token capacities for prolonged reasoning, and issues with language drift inherent to multilingual base models. These results prompt further research into robust RL algorithms and token management strategies to fully realize scalable, reasoning-capable LLMs within resource-limited environments.