Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications (2505.14918v1)

Published 20 May 2025 in cs.CL, cs.LG, and stat.ML

Abstract: This study introduces a framework for evaluating consistency in LLM binary text classification, addressing the lack of established reliability assessment methods. Adapting psychometric principles, we determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability. Our case study examines financial news sentiment classification across 14 LLMs (including claude-3-7-sonnet, gpt-4o, deepseek-r1, gemma3, llama3.2, phi4, and command-r-plus), with five replicates per model on 1,350 articles. Models demonstrated high intra-rater consistency, achieving perfect agreement on 90-98% of examples, with minimal differences between expensive and economical models from the same families. When validated against StockNewsAPI labels, models achieved strong performance (accuracy 0.76-0.88), with smaller models like gemma3:1B, llama3.2:3B, and claude-3-5-haiku outperforming larger counterparts. All models performed at chance when predicting actual market movements, indicating task constraints rather than model limitations. Our framework provides systematic guidance for LLM selection, sample size planning, and reliability assessment, enabling organizations to optimize resources for classification tasks.

Summary

Evaluation of Consistency in Binary Text Classification: A Framework utilizing LLMs

The paper "Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications" introduces a structured approach to reliability assessment in LLMs focused on binary text classification tasks. The necessity for a consistent evaluation framework is driven by the lack of established methods that ensure stability and predictability in decision support systems using LLMs.

Methodology and Case Study Insights

The proposed framework involves four key phases: planning, data collection, reliability analysis, and validity analysis. By adapting statistical schemes from psychometrics, the authors provide a methodology that integrates sample size planning with reliability and validity metrics.

In the planning phase, the framework directs the choice of measurement metrics like Conger's Kappa, Fleiss' Kappa, Gwet's AC1, and Krippendorff's Alpha to comprehensively analyze consistency across repeated tasks by the same model (intra-rater reliability) and tasks performed by different models (inter-rater reliability). Special attention is given to sample size determination to optimize data sufficiency without imposing significant computational burden. The authors adopt Gwet's approach to estimate minimum sample sizes necessary for reliable output by integrating a conservative estimation strategy through statistical correction.

In a real-world application, the study examines binary sentiment classification of financial news across 14 LLMs, including notable models such as GPT-4 and Claude-3. Notably, the research highlights a surprising observation: smaller models like the Gemma 3 and Llama 1 outperformed their larger counterparts under empirical scrutiny for internal consistency, contradicting assumptions that larger models inherently provide more reliable results.

Evaluation Metrics

Intra-Rater Reliability: Models exhibited high intra-rater reliability, with most achieving perfect agreement on 90% of the examples. This consistency level suggests that these models can provide reliable outputs under fixed configurations—crucial for applications requiring repeated categorizations of the same input data.
Inter-Rater Reliability: While inter-model consistency showed strong reliability metrics overall, it diminished as more models were added, signalling sensitivity to configurations and prompting designs.
Predictive Validity: The criteria-related validity analysis indicated discrepancies as models performed near chance-level when predicting actual stock market movements based on sentiment analysis. This result aligns with the efficient market hypothesis and underscores limitations in using news-based sentiment solely for market predictions.

Theoretical and Practical Implications

The framework offers advancements in measuring LLM reliability for classification tasks, shifting focus from model accuracy to stability as probabilistic annotators. The findings challenge the prevailing view that larger LLMs are needed for superior reliability in classification tasks, advocating instead for a nuanced evaluation based on task specificity and empirical performance.

Practically, this work provides guidelines for deploying and evaluating LLMs in resource-constrained scenarios—an important consideration for organizations balancing computational costs against performance requirements. Smaller models demonstrated comparable reliability to expensive counterparts, prompting reevaluation of upgrading motivations.

Future Research Directions

Several limitations open avenues for further investigation: expansion to multiclass classifications beyond binary tasks, adaptation across diverse domains to test generalizability, and exploration of alternative strategies for handling non-compliant model responses. Moreover, extending this framework to non-text modalities like image recognition represents an exciting potential for cross-modal reliability assessments in LLMs.

Conclusion

The structured framework introduced here offers a replicable method for rigorously evaluating the consistency, reliability, and validity of LLM-based text annotation and classification, particularly for binary tasks. By demonstrating high intra-model reliability and challenging scalability assumptions, this study directs future efforts towards optimal resource utilization and the realistic expectations of LLMs in predictive applications.