LLMs Can Generate a Better Answer by Aggregating Their Own Responses (2503.04104v2)

Published 6 Mar 2025 in cs.CL

Abstract: LLMs have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems. While approaches like self-correction and response selection have emerged as popular solutions, recent studies have shown these methods perform poorly when relying on the LLM itself to provide feedback or selection criteria. We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks. In this paper, we propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model's discriminative capabilities. GSA first samples multiple diverse responses from the LLM, then aggregates them to obtain an improved solution. Unlike previous approaches, our method does not require the LLM to correct errors or compare response quality; instead, it leverages the model's generative abilities to synthesize a new response based on the context of multiple samples. While GSA shares similarities with the self-consistency (SC) approach for response aggregation, SC requires specific verifiable tokens to enable majority voting. In contrast, our approach is more general and can be applied to open-ended tasks. Empirical evaluation demonstrates that GSA effectively improves response quality across various tasks, including mathematical reasoning, knowledge-based problems, and open-ended generation tasks such as code synthesis and conversational responses.

PDF Abstract

An Analysis of "LLMs Can Generate a Better Answer by Aggregating Their Own Responses"

The paper "LLMs Can Generate a Better Answer by Aggregating Their Own Responses" proposes a novel prompting method for LLMs called Generative Self-Aggregation (GSA). The research addresses the limitations faced by LLMs in handling complex reasoning tasks without external supervision, typically needed in discriminative judgment scenarios. The authors demonstrate that their approach improves the quality of responses generated by LLMs without relying on external feedback mechanisms, such as symbolic reasoning or human annotations, often demanded by other methodologies.

Methodology

The central innovation of the paper is GSA, which involves generating multiple diverse responses for a given query and then synthesizing these responses into an improved answer. This process contrasts with the traditional self-consistency or choose-from-N strategies where responses are typically subjected to a selection process based on voting or judgement, which requires the model to assess response quality. Instead, GSA harnesses the generative capability inherent in LLMs to directly produce an aggregated response by leveraging context provided by multiple candidate solutions.

Results

Empirical evaluations across various tasks demonstrate the efficacy of GSA:

Mathematical Reasoning: Within tasks such as GSM8K and Math, GSA achieves performance improvements comparable or superior to self-consistency methods, thanks to its ability to integrate reasoning processes rather than merely aggregating final answers.
Knowledge-Based Problems: On datasets like MMLU and GPQA, GSA matches or exceeds the performance of discriminative methods, underscoring its capacity to synthesize complex inputs without explicit judgment requirements.
Open-Ended Tasks: For tasks such as code synthesis and conversational responses, GSA's generative aggregation outperforms self-correction and choose-from-N approaches, which often falter without precise selection criteria.

Implications

The implications of this research are substantial for the field of AI, particularly in the development of robust LLM applications free from dependency on external evaluative frameworks. The paper suggests that using a generative approach to aggregate multiple responses can enhance the autonomy of LLMs, potentially reducing the need for extensive manual supervision or specialized discriminative training.

The approach also hints at promising future developments in fine-tuning LLMs to better exploit diverse reasoning paths inherent to complex tasks and in employing generative processes as a vehicle for refining model performance in open-ended environments.

Conclusion

The paper makes a significant contribution by introducing a method that strategically shifts the focus from selection to generation, improving how LLMs' capabilities can be harnessed to tackle intricate problems. The promising results across various tasks and the theoretical advancements presented by GSA suggest its potential as a foundational technique for future LLM prompt engineering and response synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zichong Li (14 papers)
Xinyu Feng (7 papers)
Yuheng Cai (1 paper)
Zixuan Zhang (38 papers)
Tianyi Liu (58 papers)
Chen Liang (140 papers)
Weizhu Chen (128 papers)
Haoyu Wang (309 papers)
Tuo Zhao (131 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/tourzhao/status/1897885197276004491

https://twitter.com/JagersbergKnut/status/1898311610998849554