Making Large Language Models Better Reasoners with Step-Aware Verifier (2206.02336v3)

Published 6 Jun 2022 in cs.CL and cs.AI

Abstract: Few-shot learning is a challenging task that requires LLMs to generalize from limited examples. LLMs like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems. To improve their reasoning skills, previous work has proposed to guide the LLM with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate. In this paper, we present DIVERSE (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of LLMs. DIVERSE has three main components: first, it generates diverse prompts to explore different reasoning paths for the same question; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme; and third, it verifies each reasoning step individually instead of the whole chain. We evaluate DIVERSE on the latest LLM code-davinci-002 and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%).

Citations (159)

View on Semantic Scholar

Summary

The paper introduces DiVeRSe, a framework that improves LLM reasoning by combining diverse prompts, a voting verifier, and step-aware verification.
The paper reports a significant GSM8K performance boost, with accuracy rising from 74.4% to 83.2% after refining intermediate reasoning steps.
The paper’s method enables granular error detection and correction, offering a robust approach for advancing reasoning accuracy in large language models.

Enhancing Reasoning in LLMs with DiVeRSe

The paper by Li et al. addresses the challenges associated with improving the reasoning abilities of LLMs. Although models like GPT-3 and PaLM demonstrate strong capabilities across many NLP tasks, they often struggle with complex reasoning, particularly in domains such as arithmetic and commonsense reasoning.

Li et al. propose DiVeRSe, a framework that aims to bolster the reasoning capabilities of LLMs by leveraging diverse prompts, a voting verifier, and a step-aware verifier. They identify that while methods like chain-of-thought reasoning have shown promise, there remain significant gaps that can be filled by incorporating these additional strategies.

Key Features of DiVeRSe

Diverse Prompts: To counteract the limitations of static prompts, DiVeRSe generates multiple diverse prompts for each question. This enables the exploration of various reasoning paths, thereby increasing the robustness of the LLM's performance. This diverse approach not only expands the potential solution space but also mitigates biases that might arise from single, static prompts.
Voting Verifier: Drawing inspiration from methods like self-consistency, which utilize majority voting across reasoning paths, DiVeRSe introduces a voting verifier. This verifier assesses each reasoning path and employs a weighted voting scheme to bolster the selection of the most accurate answer. Through this, the framework aims to address the issue where majority votes might be misleading due to dominant but incorrect reasoning paths.
Step-Aware Verifier: A critical advancement in DiVeRSe is the verification of individual reasoning steps. Unlike verifying only the final outcome, the step-aware verifier evaluates the correctness of each intermediate reasoning step. This granular assessment allows for identification and rectification of errors in reasoning sequences. It can help pinpoint where a reasoning path went wrong, which is crucial for both diagnostics and learning improvements.

Empirical Evaluation

The implementation of DiVeRSe across various reasoning benchmarks—including GSM8K, AsDiv, and others—demonstrated state-of-the-art performance on six out of eight tasks. These tasks require different types of reasoning skills, validating the versatility of DiVeRSe. For instance, on the GSM8K benchmark, the model's accuracy increased from 74.4% to 83.2%, marking a significant leap in effectiveness. The results suggest that DiVeRSe offers a more nuanced understanding by enhancing reasoning proficiency, robustly outperforming existing baselines like self-consistency.

Implications and Future Directions

The implications of this research are multifaceted. Practically, DiVeRSe provides a framework for significantly improving LLMs' performance on reasoning tasks without excessively increasing computational requirements, which is often a limitation with extremely large models like PaLM. Theoretically, it opens avenues for integrating diverse and stepwise evaluations for complex reasoning tasks across other domains.

Moving forward, the research could focus on automating the formulation of diverse prompts beyond random sampling, potentially leading to more intelligent diversification strategies. In addition, extending the framework to other reasoning-intensive applications, such as strategic planning in dynamic environments or scientific reasoning, could further enhance LLM utility. Moreover, addressing the inherent noise and inaccuracies in pseudolabeling processes could refine the training mechanisms for the verifier.

In conclusion, the paper by Li et al. offers a comprehensive and profound enhancement to the reasoning capabilities of LLMs through DiVeRSe, providing significant insights into the improvement of LLMs without resorting to untenable scaling. The framework's innovations hold promise for advancing the scope and depth of artificial intelligence in complex reasoning scenarios.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MimansaJ/status/1843400259663482975

YouTube

Show All Videos