Learning to Reason via Self-Iterative Process Feedback for Small LLMs
The paper "Learning to Reason via Self-Iterative Process Feedback for Small LLMs" presents an approach focused on enhancing the reasoning capabilities of Small LLMs (SLMs). It addresses a notable gap relative to LLMs, particularly in tasks necessitating complex reasoning processes. The authors introduce a method that leverages self-iterative feedback to fine-tune SLMs, an alternative to resource-intensive supervised fine-tuning and distillation.
Methodological Innovations
The paper's core contribution is the introduction of the Self-Iterative Process Feedback (SIPF) method. This approach is distinctive in its use of Odds Ratio Preference Optimization (ORPO) to align the models based on internally generated positive and negative reasoning samples. Instead of relying on binary feedback from final outcomes, SIPF utilizes a more granular process supervision strategy that samples and evaluates intermediate reasoning steps. This is achieved through a process reward model that assigns feedback at each reasoning stage, thereby potentially leading to more nuanced model improvements.
Key Findings
The empirical results indicate substantial improvements in reasoning tasks for SLMs, notably with the Gemma-2B model, which saw a 12.43% accuracy improvement on GSM8K and a 3.95% increase in Pass@1 on MBPP. These datasets test mathematical reasoning and code generation capabilities, respectively. Furthermore, the method demonstrates strong out-of-domain adaptability, evidence by enhanced performance on MMLU_Math and HumanEval.
Analytical Depth
The authors delve into the limitations of previous methods that rely heavily on outcome-based feedback, emphasizing the novel capacity of SIPF to discern correct reasoning steps regardless of the final answer correctness. The articulation of preference datasets through process feedback allows for a more sophisticated learning signal and positions SLMs to handle reasoning tasks previously believed to be the domain of LLMs.
Implications and Future Research
From a practical standpoint, the findings suggest the feasibility of deploying resource-efficient SLMs for reasoning-intensive applications, reducing reliance on the costly infrastructure typically associated with LLMs. Theoretical implications involve advancing our understanding of alignment processes in model training, particularly with self-generated data.
The approach sets a foundation for further exploration into expanded domains and diverse reasoning tasks. Additionally, the self-iterative nature of SIPF could inspire similar methodologies for other aspects of NLP performance beyond reasoning, such as creativity or contextual understanding in smaller models.
Conclusion
This paper contributes significantly to the ongoing discourse on optimizing LLMs, particularly by illustrating a path toward bringing sophisticated reasoning capabilities to SLMs. By underscoring the value of process feedback and iterative learning, it broadens the potential for SLMs to operate efficiently within a wider array of applications, inviting future efforts to refine and expand upon these promising preliminary results.