RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners (2403.12373v3)

Published 19 Mar 2024 in cs.CL

Abstract: LLMs have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Moreover, RankPrompt excels in LLM-based automatic evaluations for open-ended tasks, aligning with human judgments 74% of the time in the AlpacaEval dataset. It also exhibits robustness to variations in response order and consistency. Collectively, our results validate RankPrompt as an effective method for eliciting high-quality feedback from LLMs.

PDF Abstract

RankPrompt: Enhancing Reasoning Performance in LLMs by Comparative Evaluation

Introduction to RankPrompt

The evolution of LLMs has marked a significant milestone in the domain of natural language processing, showcasing a notable proficiency in dealing with various reasoning tasks. Despite the advancements, these models, including the likes of ChatGPT and GPT-4, are not immune to logical fallacies and errors in their reasoning processes. Addressing this, we introduce RankPrompt, a novel prompting technique that dramatically improves the reasoning accuracy of LLMs by leveraging a compare-and-rank strategy for diverse responses generated by the models themselves.

The Drawbacks of Existing Approaches

Current methodologies to ameliorate reasoning in LLMs often fall into two categories: deploying task-specific verifiers or adopting multiple reasoning paths and selecting the most common result through a voting mechanism. While task-specific verifiers are resource-intensive, requiring substantial human-annotated data, voting mechanisms lack interpretability and are ineffective in cases of inconsistent responses.

RankPrompt: A Detailed Overview

RankPrompt distinguishes itself by instructing LLMs to self-rank their generated responses through a sequence of systematic comparisons. Crucially, it does not rely on external resources for generating these rankings. Instead, it breaks down the complex ranking problem into manageable comparative segments, enabling a step-wise evaluation of reasoning paths that LLMs inherently generate. This approach not only enhances the reasoning accuracy across arithmetic and commonsense reasoning tasks but also demonstrates robustness in the face of ordering and consistency variations among responses.

Step-by-Step Comparative Evaluation

RankPrompt's execution involves generating multiple reasoning paths for a given question and subsequently instructing the LLM to rank these paths by comparing their steps. This comparative evaluation is systematically carried out through specific comparison instructions alongside generated comparison exemplars, significantly reducing the requirement for external labeled data.

Impact on Reasoning Tasks

Our evaluation of RankPrompt spans 11 distinct arithmetic and commonsense reasoning tasks alongside the assessment of open-ended generation, where it notably aligns with human judgment 74% of the time according to the AlpacaEval set. These results underscore RankPrompt's capacity to not only enhance the performance of LLMs across a broad array of reasoning scenarios but also its potential to set new benchmarks for LLM-based automatic evaluation.

Theoretical and Practical Implications

RankPrompt's comparative evaluation mechanism elucidates the inherent capabilities of LLMs to perform more nuanced and complex reasoning, suggesting a paradigm shift in how reasoning tasks could be approached. Practically, RankPrompt reduces the dependency on large-scale annotated datasets and complex model training, presenting a more resource-efficient avenue for enhancing LLM reasoning capabilities.

Future Directions in AI and LLM Reasoning

The success of RankPrompt opens up numerous avenues for future research, particularly in exploring the limits of LLM reasoning without extensive external resources. Further, it hints at the potential of similar prompting methods to improve other aspects of LLM performance, such as understanding and generating more complex narratives or solving sophisticated problem-solving tasks beyond the current scope.

Concluding Remarks

RankPrompt signifies a pivotal step towards realizing the full reasoning potential of LLMs. By emphasizing the importance of comparative evaluation and leveraging the models' inherent capabilities, RankPrompt not only surpasses existing methodologies in enhancing LLM reasoning but also provides a scalable and cost-effective solution that could redefine future approaches in AI reasoning tasks.