Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in LLMs
The paper, "Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in LLMs" by Zhangyue Yin et al. from Fudan University, introduces a novel methodology aimed at augmenting the reasoning capabilities of LLMs through a hierarchical aggregation framework, termed Aggregation of Reasoning (AoR).
Introduction and Motivation
Recent advances in Chain-of-Thought (CoT) prompting have significantly improved LLMs' performance on complex reasoning tasks. While current approaches typically involve generating multiple reasoning chains and selecting the most frequent answer via majority voting, this method falters when incorrect answers predominate over correct ones. The authors identify this limitation as a primary constraint on LLMs' reasoning potential, which cannot be overcome by merely focusing on predicted answers.
Methodology
The paper proposes AoR, which addresses the identified shortcomings by evaluating reasoning chains instead of just the answers. AoR employs a two-phase hierarchical approach:
- Local-Scoring Phase: Reasoning chains that yield identical answers are evaluated for their logical consistency, the appropriateness of methods used, completeness and clarity, and the application of knowledge. Chains meeting or exceeding a predefined score threshold are then considered for the global evaluation phase.
- Global-Evaluation Phase: The top chains from different answer groups undergo a comprehensive assessment to identify the chain that optimally balances coherence and consistency between the reasoning process and its corresponding answer.
Furthermore, AoR utilizes dynamic sampling, adjusting the number of reasoning chains based on task complexity, enhancing both precision and computational efficiency.
Experimental Results
Mathematical Reasoning: AoR was evaluated across six datasets including GSM8K, MultiArith, SingleEQ, SVAMP, AddSub, and AQuA. Notably, AoR outperformed existing methods such as Self-Consistency (SC), Complexity-based Consistency (CC), and DiVeRSe with significant improvements, particularly on AQuA, where AoR attained an accuracy boost of 7.2% over SC.
Commonsense and Symbolic Reasoning: The framework was also tested on datasets including StrategyQA, CommonsenseQA, BoolQ, ARC-C, Date Understanding, Penguins, Colored Objects, and Object Counting. AoR demonstrated an average performance increase of around 8.45% in commonsense reasoning tasks and significant gains in symbolic reasoning tasks.
Dynamic Sampling Efficiency: Demonstrated through analyses on the AQuA and GSM8K datasets, dynamic sampling allowed AoR to conclude most samples after the first round, reserving computational resources for more complex queries.
Discussion
The paper emphasizes the importance of robust evaluation of reasoning processes over simple answer frequency. By performing thorough assessments of reasoning chains and dynamically adjusting sampling based on task complexity, AoR reduces computational costs while enhancing answer accuracy. This approach addresses the central limitation of majority-voting methods and improves the robustness and adaptability of LLMs across different reasoning tasks.
In further evaluations using diverse LLM backbones like GPT-4, Claude-2, LLaMA-2-70B-Chat, and Mixtral-8x7B, AoR consistently outperformed traditional methods, showcasing its flexibility and potential for broad application. The integration of superior evaluation models, such as GPT-4 during the evaluation phases, further enhanced performance, underscoring the importance of the evaluation quality in the AoR framework.
Conclusion
AoR marks a significant step towards refining reasoning capabilities in LLMs by shifting the focus from answer frequency to in-depth reasoning chain evaluation. This hierarchical framework not only provides a comprehensive solution to the limitations of majority-voting mechanisms but also demonstrates versatility across various reasoning tasks and LLM architectures. Future work can explore further optimization of evaluation criteria and the integration of more sophisticated LLMs to achieve even higher performance benchmarks.