Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't (2503.16219v1)

Published 20 Mar 2025 in cs.LG and cs.CL

Abstract: Enhancing the reasoning capabilities of LLMs typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

Summary

  • The paper demonstrates that modified GRPO-based RL fine-tuning significantly enhances reasoning accuracy in small LLMs using a targeted math dataset.
  • It employs a balanced curriculum of problem difficulty and a composite reward system addressing accuracy, response length, and format.
  • The approach achieves notable performance gains with minimal cost while highlighting challenges like optimization instability and token limitations.

Overview

The paper provides a systematic investigation into reinforcement learning (RL) fine-tuning for reasoning in small LLMs, exemplified by the 1.5B parameter DeepSeek-R1-Distill-Qwen-1.5B model. The objective was to augment its reasoning capabilities in a computationally constrained setting using a variant of Group Relative Policy Optimization (GRPO). Performance improvements were validated on standard reasoning benchmarks using a compact, carefully curated mathematical reasoning dataset. Numerical results indicate significant improvements in reasoning accuracy accompanied by notable cost efficiencies.

Effective RL Strategies

Data Curation and Problem Mixture

A cornerstone of the approach is the use of a high-quality, narrowly focused dataset tailored for mathematical reasoning. Data was selectively filtered from the s1 and DeepScaleR corpora to ensure problem difficulty and domain relevance. The strategy of mixing easier problems with more challenging ones provided stable learning dynamics. This balanced curriculum enabled early improvements and maintained robustness in error backpropagation.

Group Relative Policy Optimization (GRPO) Adaptation

The modified GRPO algorithm eschews the need for a separate critic network by computing baselines from aggregated group scores. This group-wise strategy not only economizes computational resources but also simplifies tuning. More specifically, the approach leverages a rule-based reward system which comprises three distinct elements:

  • Accuracy Reward: A binary signal contingent on the correctness of the final answer (formatted with an unambiguous \boxed{\cdot} representation).
  • Cosine Reward: A scaling factor linked to response length, imposing a bias towards more concise and precise outputs.
  • Format Reward: A structural incentive rewarding outputs that encapsulate reasoning within clearly delineated tokens (typically > and ``).

This composite reward structure was integral to fine-tuning the model’s behavior, leading to rapid increases in task-specific metrics.

Numerical Gains and Cost Efficiency

Key numerical results underscore the practical benefits of the RL approach:

  • AMC23 Accuracy: Increased from 63% to 80% post RL fine-tuning.
  • AIME24 Performance: Achieved 46.7%, exceeding baseline benchmarks such as o1-preview.

The method attained these improvements using only 7,000 training samples with six outputs per sample, incurring an approximate training cost of $42—orders of magnitude lower than conventional methods requiring extensive computational budgets.

Limitations and Challenges in Implementation

Optimization Instability

The fine-tuning process exhibited sensitivity to prolonged training, manifesting as an optimization instability. After extended epochs, the model occasionally demonstrates performance degradation and unpredictability in output sequences, including the emergence of multilingual artifacts. This suggests that sustained policy updates can lead to mode collapse or reward misalignment.

Tokenization and Length Constraints

The training regimen imposed strict token limits (originally 4096 tokens, later reduced to 3584) which proved problematic for capturing complete reasoning paths in complex problem instances. The enforced truncation limits inhibit the model's capacity to articulate extended chains of reasoning, necessitating further innovations in dynamic token management or segmented reasoning frameworks.

Base Model Multilinguality

An additional challenge stems from the underlying multilingual pre-training of the model. Despite dedicated prompt engineering aimed at enforcing unilingual (English) outputs, latent multilingual properties sometimes cause inadvertent language drifts. This phenomenon underscores the need for more robust language control mechanisms during RL fine-tuning.

Implementation Considerations

Computational Architecture

The experiments were conducted using a cluster of 4 NVIDIA A40 GPUs (48 GB VRAM each) over a 24-hour training period. For practitioners intending to replicate or extend these results, a similar constrained environment should be considered. Techniques such as mixed precision training and gradient checkpointing can be leveraged to further optimize resource usage.

Pseudocode for Training Procedure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
initialize model: DeepSeek-R1-Distill-Qwen-1.5B
initialize dataset: curated_math_reasoning_dataset
initialize GRPO parameters and reward function weights

for epoch in num_epochs:
    for batch in dataset:
        # Run the forward pass to generate outputs
        outputs = model(batch.inputs)
        
        # Compute rewards
        acc_reward = compute_accuracy_reward(outputs, batch.targets)
        cosine_reward = compute_cosine_reward(outputs, desired_length=optimal_length)
        format_reward = compute_format_reward(outputs)
        
        total_reward = acc_reward + cosine_reward + format_reward
        
        # Group Relative Policy Update
        group_scores = form_grouped_scores(outputs, total_reward)
        baseline = compute_baseline(group_scores)
        policy_loss = compute_policy_loss(outputs, group_scores, baseline)
        
        # Backpropagation and parameter update
        optimizer.zero_grad()
        policy_loss.backward()
        optimizer.step()
    
    evaluate_model(model, validation_set)

Trade-Offs and Scaling Considerations

The design choices reflect critical trade-offs between performance gains and stability. While RL fine-tuning delivers rapid improvements in reasoning metrics under constrained conditions, challenges like optimization instability and token limitations impose limits on the scalability of this method. Advanced proposals for future work might involve:

  • Hybrid approaches that incorporate supervised learning components periodically to recalibrate the model.
  • Adaptive token budgeting strategies that allow longer responses for complex reasoning tasks.
  • Enhanced reward formulations that mitigate language drift and promote output consistency.

Conclusion

The paper demonstrates that RL fine-tuning via a modified GRPO framework can enhance the reasoning abilities of small LLMs efficiently. Key factors contributing to these improvements include targeted data curation, a nuanced reward system, and the strategic mixing of problem difficulties. Despite robust numerical advances (e.g., AMC23 accuracy improvement from 63% to 80%), the approach encounters challenges such as optimization instability, inadequate token capacities for prolonged reasoning, and issues with language drift inherent to multilingual base models. These results prompt further research into robust RL algorithms and token management strategies to fully realize scalable, reasoning-capable LLMs within resource-limited environments.