Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models (2405.12939v1)

Published 21 May 2024 in cs.CL

Abstract: Recent advancements in Chain-of-Thought prompting have facilitated significant breakthroughs for LLMs in complex reasoning tasks. Current research enhances the reasoning performance of LLMs by sampling multiple reasoning chains and ensembling based on the answer frequency. However, this approach fails in scenarios where the correct answers are in the minority. We identify this as a primary factor constraining the reasoning capabilities of LLMs, a limitation that cannot be resolved solely based on the predicted answers. To address this shortcoming, we introduce a hierarchical reasoning aggregation framework AoR (Aggregation of Reasoning), which selects answers based on the evaluation of reasoning chains. Additionally, AoR incorporates dynamic sampling, adjusting the number of reasoning chains in accordance with the complexity of the task. Experimental results on a series of complex reasoning tasks show that AoR outperforms prominent ensemble methods. Further analysis reveals that AoR not only adapts various LLMs but also achieves a superior performance ceiling when compared to current methods.

PDF Abstract

Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in LLMs

The paper, "Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in LLMs" by Zhangyue Yin et al. from Fudan University, introduces a novel methodology aimed at augmenting the reasoning capabilities of LLMs through a hierarchical aggregation framework, termed Aggregation of Reasoning (AoR).

Introduction and Motivation

Recent advances in Chain-of-Thought (CoT) prompting have significantly improved LLMs' performance on complex reasoning tasks. While current approaches typically involve generating multiple reasoning chains and selecting the most frequent answer via majority voting, this method falters when incorrect answers predominate over correct ones. The authors identify this limitation as a primary constraint on LLMs' reasoning potential, which cannot be overcome by merely focusing on predicted answers.

Methodology

The paper proposes AoR, which addresses the identified shortcomings by evaluating reasoning chains instead of just the answers. AoR employs a two-phase hierarchical approach:

Local-Scoring Phase: Reasoning chains that yield identical answers are evaluated for their logical consistency, the appropriateness of methods used, completeness and clarity, and the application of knowledge. Chains meeting or exceeding a predefined score threshold are then considered for the global evaluation phase.
Global-Evaluation Phase: The top chains from different answer groups undergo a comprehensive assessment to identify the chain that optimally balances coherence and consistency between the reasoning process and its corresponding answer.

Furthermore, AoR utilizes dynamic sampling, adjusting the number of reasoning chains based on task complexity, enhancing both precision and computational efficiency.

Experimental Results

Mathematical Reasoning: AoR was evaluated across six datasets including GSM8K, MultiArith, SingleEQ, SVAMP, AddSub, and AQuA. Notably, AoR outperformed existing methods such as Self-Consistency (SC), Complexity-based Consistency (CC), and DiVeRSe with significant improvements, particularly on AQuA, where AoR attained an accuracy boost of 7.2% over SC.

Commonsense and Symbolic Reasoning: The framework was also tested on datasets including StrategyQA, CommonsenseQA, BoolQ, ARC-C, Date Understanding, Penguins, Colored Objects, and Object Counting. AoR demonstrated an average performance increase of around 8.45% in commonsense reasoning tasks and significant gains in symbolic reasoning tasks.

Dynamic Sampling Efficiency: Demonstrated through analyses on the AQuA and GSM8K datasets, dynamic sampling allowed AoR to conclude most samples after the first round, reserving computational resources for more complex queries.

Discussion

The paper emphasizes the importance of robust evaluation of reasoning processes over simple answer frequency. By performing thorough assessments of reasoning chains and dynamically adjusting sampling based on task complexity, AoR reduces computational costs while enhancing answer accuracy. This approach addresses the central limitation of majority-voting methods and improves the robustness and adaptability of LLMs across different reasoning tasks.

In further evaluations using diverse LLM backbones like GPT-4, Claude-2, LLaMA-2-70B-Chat, and Mixtral-8x7B, AoR consistently outperformed traditional methods, showcasing its flexibility and potential for broad application. The integration of superior evaluation models, such as GPT-4 during the evaluation phases, further enhanced performance, underscoring the importance of the evaluation quality in the AoR framework.

Conclusion

AoR marks a significant step towards refining reasoning capabilities in LLMs by shifting the focus from answer frequency to in-depth reasoning chain evaluation. This hierarchical framework not only provides a comprehensive solution to the limitations of majority-voting mechanisms but also demonstrates versatility across various reasoning tasks and LLM architectures. Future work can explore further optimization of evaluation criteria and the integration of more sophisticated LLMs to achieve even higher performance benchmarks.