LLM Stability: A detailed analysis with some surprises (2408.04667v2)

Published 6 Aug 2024 in cs.CL, cs.AI, cs.LG, and cs.SE

Abstract: LLM practitioners commonly notice that outputs can vary for the same inputs, but we have been unable to find work that evaluates LLM stability as the main objective. In our study of 6 deterministically configured LLMs across 8 common tasks with 5 identical runs, we see accuracy variations up to 10\%. In addition, no LLM consistently delivers repeatable accuracy across all tasks. We also show examples of variation that are not normally distributed and compare configurations with zero-shot/few-shot prompting and fine-tuned examples. To better quantify what is going on, we introduce metrics focused on stability: TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement over parsed-out answers. We suggest that stability metrics be integrated into leader boards and research results going forward.

PDF HTML Abstract

Overview of "LLM Stability: A detailed analysis with some surprises"

The paper entitled "LLM Stability: A detailed analysis with some surprises" explores the under-explored domain of stability in LLMs. This paper is centered on the variability of LLM outputs given the same inputs and fixed hyper-parameters, focusing particularly on the reproducibility of results. LLM stability has significant implications for both the application of these models in real-world applications and the trustworthiness of benchmarks used to evaluate them.

Key Findings

The paper systematically assesses the stability of several prominent LLMs, including GPT-3.5 Turbo, GPT-4o, Llama-3-70B-Instruct, Llama-3-8B-Instruct, and Mixtral-8x7B-Instruct. The authors utilized common benchmarks and tasks from the Beyond Imitation Game Benchmark Hard (BBH) and Measuring Massive Multitask Language Understanding (MMLU) datasets. They discovered some noteworthy points:

Non-Deterministic Outputs: It is uncommon for LLMs to produce the same raw output for identical inputs even when hyper-parameters are set to ensure determinism.
Accuracy Distribution: Variations in LLM accuracy are not normally distributed.
Task-Dependent Stability: Stability varies considerably depending on the specific task being performed.

Experimental Setup and Metrics

The researchers used a few-shot prompting protocol without Chain-of-Thought (CoT) reasoning. They controlled variables including temperature (set to zero), top-p (set to one), and fixed random seeds to ensure consistent conditions across five reruns for each model and dataset.

Metrics included:

Minimum, median, and maximum accuracy values across runs.
Minimum-maximum spread in accuracy.
Total Agreement Rate (TAR) with two submetrics:
- TARa (agreement on the parsed answer)
- TARr (agreement on the raw model response)

Numerical Results

The paper revealed variability in the deterministic properties of LLMs:

GPT-3.5 Turbo showed high stability, with TARa scores frequently reaching near 100\%.
GPT-4o and Llama-3-70B-Instruct, among others, exhibited significant variability in agreement rates and accuracy metrics.
For example, GPT-4o only achieved a median TARr of 3\%, while GPT-3.5 Turbo managed a much more stable 97\%.

Specific tasks such as college math and geometric shapes showed greater variability in output stability compared to tasks like European history.

Implications of Findings

Practical Implications:

Commercial Use: Non-deterministic outputs can severely affect customer satisfaction, reproducibility, and reliability of AI services.
System Design: Developers need to design systems to handle variability, particularly when high precision and reliability are necessary.

Theoretical Implications:

The non-normal distribution of accuracy variations suggests underlying complexities in floating point computations or other intrinsic properties of the LLMs.
Understanding the task-dependent nature of stability can drive more nuanced benchmarking and model development efforts.

Speculation on Future Developments

The paper's findings point toward several avenues for future research:

Investigating the root causes of non-deterministic outputs and developing methods to minimize them.
Expanding the range of tasks and models analyzed to understand stability across different domains comprehensively.
Developing new evaluation benchmarks that account for variability in model responses.
Improving parsers for answer extraction to handle variations better.

Conclusion

This paper makes a substantial contribution by quantifying the stability of LLMs, revealing non-deterministic behavior even under conditions optimized for consistency. Such findings necessitate a re-evaluation of how benchmarks assess model performance and how these models are implemented in critical applications. Future work could explore methodologies to enhance stability further and understand the factors contributing to variability.

By shedding light on a critical aspect of LLM functionality, this paper provides a foundation for more robust and reliable AI systems, ultimately advancing both academic understanding and practical applications of LLM technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Berk Atil (6 papers)
Alexa Chittams (1 paper)
Liseng Fu (1 paper)
Ferhan Ture (14 papers)
Lixinyu Xu (2 papers)
Breck Baldwin (2 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1825278336119234943

https://twitter.com/berkatilgs/status/1836747613934125093

https://twitter.com/yourthefool/status/1930415863297126821

https://twitter.com/GptMaestro/status/1825986895173579160

https://twitter.com/IanArawjo/status/1894839713539719586

YouTube

Show All Videos