Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Reason via Self-Iterative Process Feedback for Small Language Models (2412.08393v1)

Published 11 Dec 2024 in cs.CL

Abstract: Small LLMs (SLMs) are more efficient, cost-effective, and customizable than LLMs, though they often underperform in specific areas like reasoning. Past methods for enhancing SLMs' reasoning, such as supervised fine-tuning and distillation, often depend on costly external signals, resulting in SLMs being overly confident with limited supervision signals, thus limiting their abilities. Therefore, this study enables SLMs to learn to reason from self-iterative feedback. By combining odds ratio preference optimization (ORPO), we fine-tune and align SLMs using positive and negative signals generated by themselves. Additionally, we introduce process supervision for rewards in preference alignment by sampling-based inference simulation and process reward models. Compared to Supervised Fine-Tuning (SFT), our method improves the performance of Gemma-2B by 12.43 (Acc) on GSM8K and 3.95 (Pass@1) on MBPP. Furthermore, the proposed method also demonstrated superior out-of-domain generalization capabilities on MMLU_Math and HumanEval.

Learning to Reason via Self-Iterative Process Feedback for Small LLMs

The paper "Learning to Reason via Self-Iterative Process Feedback for Small LLMs" presents an approach focused on enhancing the reasoning capabilities of Small LLMs (SLMs). It addresses a notable gap relative to LLMs, particularly in tasks necessitating complex reasoning processes. The authors introduce a method that leverages self-iterative feedback to fine-tune SLMs, an alternative to resource-intensive supervised fine-tuning and distillation.

Methodological Innovations

The paper's core contribution is the introduction of the Self-Iterative Process Feedback (SIPF) method. This approach is distinctive in its use of Odds Ratio Preference Optimization (ORPO) to align the models based on internally generated positive and negative reasoning samples. Instead of relying on binary feedback from final outcomes, SIPF utilizes a more granular process supervision strategy that samples and evaluates intermediate reasoning steps. This is achieved through a process reward model that assigns feedback at each reasoning stage, thereby potentially leading to more nuanced model improvements.

Key Findings

The empirical results indicate substantial improvements in reasoning tasks for SLMs, notably with the Gemma-2B model, which saw a 12.43% accuracy improvement on GSM8K and a 3.95% increase in Pass@1 on MBPP. These datasets test mathematical reasoning and code generation capabilities, respectively. Furthermore, the method demonstrates strong out-of-domain adaptability, evidence by enhanced performance on MMLU_Math and HumanEval.

Analytical Depth

The authors delve into the limitations of previous methods that rely heavily on outcome-based feedback, emphasizing the novel capacity of SIPF to discern correct reasoning steps regardless of the final answer correctness. The articulation of preference datasets through process feedback allows for a more sophisticated learning signal and positions SLMs to handle reasoning tasks previously believed to be the domain of LLMs.

Implications and Future Research

From a practical standpoint, the findings suggest the feasibility of deploying resource-efficient SLMs for reasoning-intensive applications, reducing reliance on the costly infrastructure typically associated with LLMs. Theoretical implications involve advancing our understanding of alignment processes in model training, particularly with self-generated data.

The approach sets a foundation for further exploration into expanded domains and diverse reasoning tasks. Additionally, the self-iterative nature of SIPF could inspire similar methodologies for other aspects of NLP performance beyond reasoning, such as creativity or contextual understanding in smaller models.

Conclusion

This paper contributes significantly to the ongoing discourse on optimizing LLMs, particularly by illustrating a path toward bringing sophisticated reasoning capabilities to SLMs. By underscoring the value of process feedback and iterative learning, it broadens the potential for SLMs to operate efficiently within a wider array of applications, inviting future efforts to refine and expand upon these promising preliminary results.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Kaiyuan Chen (26 papers)
  2. Jin Wang (356 papers)
  3. Xuejie Zhang (23 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com