SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses (2404.04298v3)

Published 4 Apr 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that models are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.

Authors (6)

Dongwei Jiang (16 papers)
Jingyu Zhang (40 papers)
Orion Weller (31 papers)
Nathaniel Weir (17 papers)
Benjamin Van Durme (173 papers)
Daniel Khashabi (83 papers)

Summary

An Examination of the Discrimination Capabilities of LLMs

The paper "SelfDiLLMa: LLMs Struggle with Refining Self-Generated Responses" addresses the question of whether LLMs can effectively discriminate among their previously generated alternatives compared to generating initial responses. The authors explore this through the lens of a hypothesis that examines whether LLMs' discriminative capabilities exceed their generation abilities. This endeavor is particularly crucial as it impacts the development of self-improving AI systems.

The authors introduce a unified evaluation framework to assess LLMs' generative and discriminative capabilities on an equal footing across various tasks, including mathematics (GSM8K), world knowledge (TriviaQA), truthful question answering (TruthfulQA), and instruction following (MT-Bench). Their findings reveal that, generally, LLMs do not perform better in the discrimination phase compared to the generation phase, challenging the notion of LLMs being inherently capable of self-improvement through self-evaluation.

A key methodological innovation in this paper is the framework used for evaluating LLMs, which permits a comparative analysis of generative output versus discriminative capability. Specifically, the methodology involves a two-phase process: a generation phase that randomly selects outputs, and a discrimination phase where the LLM attempts to choose the best response from its generated outputs. This setup mirrors the self-improvement process and allows for large-scale evaluation without human intervention.

Examined models include those from the LLaMA-2 series as well as proprietary models such as GPT-3.5 and GPT-4 developed by OpenAI. The results consistently show that DG-Diff, an indicator of the difference in capability between the two tasks, is generally smaller or negative, meaning that discrimination is either not superior or indeed inferior to generation. These findings hold regardless of the model's fine-tuning and size, suggesting a fundamental limitation in LLMs' discrimination abilities.

The paper further explores whether this limitation is related to the autoregressive nature of LLM pre-training. By examining Flan-T5 and Flan-UL2 models which do not rely on autoregressive objectives, the authors find that these models exhibit a positive DG-Diff, suggesting that the pre-training process might influence the capability gap between generation and discrimination.

Moreover, the authors investigate whether certain enhancements to the discrimination phase, such as improved prompt engineering or the inclusion of more in-context learning examples, could alter the observed performance. Despite modifications, DG-Diff remains small or negative, reinforcing the paper's assertion that the issue is not solely due to prompt engineering.

The implications of these findings are significant for the design and deployment of self-refinement techniques using LLMs. While the paper does not entirely dismiss the potential of self-rewarding techniques, it suggests that without substantial discrimination improvements, the efficacy of such self-improvement frameworks may be limited. Additionally, the paper highlights the need for alternative strategies that might leverage non-traditional pre-training objectives to overcome these limitations.

In conclusion, the paper provides a comprehensive analysis of the limitations inherent in the self-discriminative capabilities of LLMs. It challenges prevailing assumptions about the potential for LLMs to engage in meaningful self-improvement without external feedback or guidance. Future studies could explore the intricacy of pre-training techniques further, refine evaluation frameworks, and identify alternative mechanisms to mitigate the observed limitations. These endeavors will be instrumental in developing truly autonomous, self-improving artificial intelligence systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RonZiruChen/status/1798907364873257099

https://twitter.com/Dongwei__Jiang/status/1895537440611504619

https://twitter.com/Dongwei__Jiang/status/1866521898030399891

https://twitter.com/Dongwei__Jiang/status/1924501385988755562

https://twitter.com/Dongwei__Jiang/status/1781130100458562000

https://twitter.com/DongweiJiang8/status/1777708612464545834

YouTube

Show All Videos