LLMs' Classification Performance is Overclaimed (2406.16203v3)

Published 23 Jun 2024 in cs.CL

Abstract: In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing LLMs, in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks? In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit the expected comprehension of the task. This paper makes a threefold contribution: i) To our knowledge, this is the first work to identify the limitations of LLMs in classification tasks when gold labels are absent. We define this task as Classify-w/o-Gold and propose it as a new testbed for LLMs. ii) We introduce a benchmark, Know-No, comprising two existing classification tasks and one new task, to evaluate Classify-w/o-Gold. iii) This work defines and advocates for a new evaluation metric, OmniAccuracy, which assesses LLMs' performance in classification tasks both when gold labels are present and absent.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces the Classify-w/o-Gold task to critically assess LLM performance without explicit gold labels.
The paper presents the Know-No benchmark, incorporating tasks like Bank-77, MC-Test, and EquInfer to simulate real-world classification challenges.
The paper proposes OmniAccuracy, a comprehensive metric that reveals LLMs significantly underperform compared to human benchmarks when gold labels are absent.

A Formal Evaluation of LLMs' Classification Performance: The Know-No Benchmark

The paper "LLMs' Classification Performance is Overclaimed" by Hanzi Xu and colleagues presents a rigorous examination of LLMs in classification tasks, particularly scrutinizing their purported intelligence when classifying without the presence of explicit gold labels. The paper aims to challenge the prevailing notion that LLMs exhibit robust understanding and discrimination in classification tasks, instead arguing that current evaluations overstate their effectiveness due to insufficiently rigorous benchmarks and metrics.

Key Contributions

Introduction of Classify-w/o-Gold Task: This paper is pioneering in identifying the limitations of LLMs in classification tasks when the gold labels are excluded from the label space. The authors define this scenario as Classify-w/o-Gold and propose it as a novel evaluation framework for LLMs.
Know-No Benchmark: The authors introduce a new benchmark, Know-No, which includes two existing classification tasks and one newly created task, EquInfer. This benchmark is specifically designed to evaluate the performance of LLMs under the Classify-w/o-Gold framework.
OmniAccuracy Metric: They propose a new evaluation metric, OmniAccuracy, which combines performance metrics across conditions where gold labels are both present and absent. This metric aims to provide a more comprehensive assessment of LLMs' classification capabilities.

Research Methodology

The paper employs three classification tasks within the Know-No benchmark:

Bank-77: An intent classification task based on customer service queries in the banking domain, characterized by moderate complexity and large label space.
MC-Test: A multiple-choice question answering task using elementary school-level stories, chosen for its simplicity and straightforward answer patterns.
EquInfer: A newly assembled task involving the inference of original mathematical equations from surrounding contextual paragraphs, representing high complexity and requiring domain-specific expertise.

To evaluate these tasks, the research employs various state-of-the-art LLMs, both closed-source (GPT-4, Claude 3) and open-source (Llama 3, Gemma, Mistral). The paper compares LLM performance with human performance, ensuring rigorous testing conditions that reflect real-world scenarios, such as providing models with hints when gold labels are absent or leaving them entirely unhinted.

Strong Numerical Results

The paper reveals several pivotal findings:

GPT-4 Performance: Achieves 98.67% accuracy on MC-Test with gold labels present but drops to around 67.29% when gold labels are absent and hints are provided (E[Accuracy_w/o]).
Comparison of Hints: When gold labels are absent, closed-source LLMs perform better with the "Hint-in-Instru" prompting technique, whereas open-source LLMs show better results with "Hint-as-Option".
Human vs. LLM Performance: Humans outperform LLMs significantly in the Classify-w/o-Gold task. For instance, on MC-Test, human performance approximates 97.67% OmniAccuracy, indicating a much stronger ability to identify the absence of correct labels compared to LLMs.

Theoretical and Practical Implications

The research compels a reevaluation of how LLMs are benchmarked in classification tasks. The traditional approach, which assumes the presence of gold labels, may inflate performance assessments and obscure true shortcomings in model intelligence. OmniAccuracy and the Know-No benchmark present more stringent and realistic metrics, pushing for the development of LLMs that not only generate accurate classifications when correct options are present but also recognize and appropriately handle scenarios where they are not.

Speculations on Future Developments

Given these findings, future research could explore the creation of LLM architectures that inherently incorporate mechanisms for uncertainty detection and rejection of invalid label sets. This would align LLM behavior closer to human-like reasoning, further enhancing their utility in real-world applications where label correctness isn't always guaranteed. Additionally, expanding the Know-No benchmark with more diverse and complex classification tasks could help in understanding the broader applicability and limitations of LLMs in various domains.

Conclusion

The paper provides a critical and illuminating assessment of LLMs' classification performance, proposing robust frameworks to evaluate their real intelligence. The Know-No benchmark and OmniAccuracy metric together represent significant steps towards a more reliable evaluation of LLMs, challenging researchers to push beyond current capabilities and develop models that are resilient, accurate, and truly intelligent in understanding and performing classification tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Wenpeng_Yin/status/1805753813040873483

https://twitter.com/fly51fly/status/1807532908254814489

https://twitter.com/knishimae0531/status/1807617492770283841

YouTube

Show All Videos