GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements (2402.10963v2)

Published 13 Feb 2024 in cs.CL and cs.LG

Abstract: State-of-the-art LLMs can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.

Citations (22)

View on Semantic Scholar

Summary

The paper decomposes the refinement process into three stages and introduces SORMs for accurate intermediate-step evaluation.
It demonstrates that blending global revisions with targeted local corrections significantly improves performance on reasoning tasks.
The approach enables LLM self-improvement without external feedback, offering practical benefits for applications like tutoring and code assistance.

Overview of "GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements"

This paper introduces a novel approach to enhance the reasoning capabilities of LLMs through systematic refinements. Reasoning tasks in LLMs often involve an intricate process of evaluating solutions, identifying errors, and implementing corrections to improve accuracy. The authors present a structured methodology that employs both global and local refinements to iteratively improve the performance of LLMs on reasoning tasks without relying on external feedback.

Core Contributions

Decomposition of the Refinement Process: The paper breaks down the refinement problem into three discrete stages: deciding when to refine, identifying where to refine, and executing how to refine. This decomposition allows for a targeted approach in addressing the innate weaknesses of LLM reasoning capabilities.
Introduction of Stepwise Outcome-based Reward Models (SORMs): A central innovation in this work is the development of SORMs. These models are trained exclusively on synthetic data to predict the correctness of intermediate steps in problem solving. By simulating multiple potential outcomes from each step, SORMs provide a more accurate assessment of whether a step is likely to lead to a correct solution, thereby offering better intermediate-step feedback compared to traditional Outcome-based Reward Models (ORMs).
Refinement Models: The authors distinguish between global refinement models, which revise entire solution drafts based on initial inputs, and local refinement models, which modify specific solution steps identified as erroneous. The paper demonstrates that the combination of these two refinement strategies, when reenforced by ORM reranking, substantially enhances accuracy.
Quantitative Improvements: The application of these methodologies, particularly the blended approach of SORM-fueled local refinements and ORM-guided global refinements, enables significant accuracy improvements. Specifically, the accuracy of a LLaMA-2 13B model on the GSM8K benchmark is raised from 53% to 65% through this combined approach.

Theoretical and Practical Implications

By introducing SORMs, the authors enrich the toolkit available for LLM error detection in reasoning tasks. This advancement has theoretical implications for model-based RL settings, suggesting avenues for incorporating dynamic feedback and iterative learning without substantial computational overhead from human annotations. Practically, these models can improve user-facing applications like automated tutoring systems or interactive coding environments, where precise and detailed reasoning feedback is crucial.

The comprehensive training pipeline delineated in the paper, including the systematic generation of synthetic data and refinements, offers a replicable model for applying these concepts across diverse reasoning scenarios. The paper largely demonstrates how nuanced, low-granularity feedback mechanisms can drastically enhance LLM reasoning performance, advocating for broader consideration of locally focused, stepwise learning paradigms in AI research.

Future Directions

The research presents promising future directions, including refining the capability of SORMs to better mimic human-like reasoning and adapting value-based refinement strategies to account for model-specific processing capacities. Another notable direction involves expanding the refinement frameworks to include additional feedback sources or auxiliary tasks, potentially leading to more autonomous and intelligent systems capable of self-improvement over time.

In conclusion, this paper establishes a foundational framework for reasoning refinement in LLMs, providing both a theoretical and practical lens through which to view enhancement strategies. The successful combination of global and local refinement models supported by thorough quantitative evaluations underscores the value and untapped potential of these methods in advancing artificial intelligence.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Dahoas1/status/1815026236592185584

https://twitter.com/robertarail/status/1766426473840189791

https://twitter.com/IntuitMachine/status/1760978054271357439

https://twitter.com/_akhaliq/status/1759808993092882872

https://twitter.com/Dahoas1/status/1760021663708787026

https://twitter.com/AGI_Odyssey/status/1759987193513185696