Universal Self-Consistency for Large Language Model Generation (2311.17311v1)

Published 29 Nov 2023 in cs.CL and cs.AI

Abstract: Self-consistency with chain-of-thought prompting (CoT) has demonstrated remarkable performance gains on various challenging tasks, by utilizing multiple reasoning paths sampled from LLMs. However, self-consistency relies on the answer extraction process to aggregate multiple solutions, which is not applicable to free-form answers. In this work, we propose Universal Self-Consistency (USC), which leverages LLMs themselves to select the most consistent answer among multiple candidates. We evaluate USC on a variety of benchmarks, including mathematical reasoning, code generation, long-context summarization, and open-ended question answering. On open-ended generation tasks where the original self-consistency method is not applicable, USC effectively utilizes multiple samples and improves the performance. For mathematical reasoning, USC matches the standard self-consistency performance without requiring the answer formats to be similar. Finally, without access to execution results, USC also matches the execution-based voting performance on code generation.

PDF Abstract

An Overview of Universal Self-Consistency in LLM Generation

The paper "Universal Self-Consistency for LLM Generation," explores the enhancement of LLM predictions through a novel method termed Universal Self-Consistency (USC). Traditional self-consistency techniques emphasize improved accuracy by selecting the most recurrent solution path from multiple sampled reasoning paths, which necessitates a rigid answer format conducive to aggregation. In contrast, the USC method leverages inherent LLM capabilities to adjudicate among multiple candidate responses, enhancing applicability across diverse tasks, including those with free-form answers.

Self-Consistency and its Limitations

Self-consistency, as outlined by \cite{wang2022self}, is instrumental in propelling LLM performance on rigid format tasks where the final answer can be aggregated through an exact match. In such contexts, the model samples diverse reasoning paths, with the majority path ensuring the final decision. This foundation in self-consistency predicates its efficacy on the capacity to quantify majority solutions—a constraint in tasks where responses are non-standard or entirely open-ended.

USC Methodology

USC revolutionizes this space by deploying LLMs not only for generation but also for self-evaluation. The process entails sampling multiple responses from the LLM and subsequently using the same model to identity the most consistent response among them. This approach transcends the limitation of explicit answer extraction, thereby extending self-consistency to free-form answer tasks. Through rigorous testing on several benchmarks—including mathematical reasoning, code generation, and open-ended question answering—USC's capacity for enhancement becomes evident.

Empirical Results and Implications

The evaluation of USC across benchmarks indicates its ability to match or surpass existing approaches, particularly traditional self-consistency and execution-based selection in code generation without relying on execution results. For mathematical reasoning, USC aligns closely with standardized self-consistency performance, while demonstrating robustness to ordering of candidate responses, highlighting the flexibility and adaptability of LLMs in self-assessment tasks.

USC generalizes existing methodologies, permitting consistency-driven aggregation across tasks previously confronted with aggregation challenges due to answer format constraints. This generalization fosters wider applicability, potentially streamlining workflows in diverse fields from academic research to industry applications reliant on LLM capabilities. Moreover, by minimizing the need for additional algorithmic training or execution-based assessments, USC offers a generalized self-consistency that is computationally efficient.

Challenges and Path Forward

Despite evidenced practical improvements, limitations are notable. The dependence on the LLM's position bias during selection and long-context handling requires further refinement. Additionally, as the sampling volume increases, certain tasks (e.g., GSM8K) exhibit performance saturation, indicating the need for enhanced strategies in leveraging the breadth of sampled responses.

In conclusion, Universal Self-Consistency represents a significant evolution in LLM task execution, offering a robust template for addressing complex, free-form generative tasks. Future research might focus on refining USC's scalability and addressing the intricate dynamics of response power and selection to further enhance LLM capabilities across burgeoning AI landscapes. The paper posits USC not merely as an enhancement mechanism but as a transformative tool for unlocking the potential of LLMs across multifaceted domains.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Xinyun Chen (80 papers)
Renat Aksitov (7 papers)
Uri Alon (40 papers)
Jie Ren (329 papers)
Kefan Xiao (7 papers)
Pengcheng Yin (42 papers)
Sushant Prakash (15 papers)
Charles Sutton (74 papers)
Xuezhi Wang (64 papers)
Denny Zhou (65 papers)

Citations (42)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

YouTube

Show All Videos