Papers
Topics
Authors
Recent
Search
2000 character limit reached

Non-Determinism of "Deterministic" LLM Settings

Published 6 Aug 2024 in cs.CL, cs.AI, cs.LG, and cs.SE | (2408.04667v5)

Abstract: LLM practitioners commonly notice that outputs can vary for the same inputs under settings expected to be deterministic. Yet the questions of how pervasive this is, and with what impact on results, have not to our knowledge been systematically investigated. We investigate non-determinism in five LLMs configured to be deterministic when applied to eight common tasks in across 10 runs, in both zero-shot and few-shot settings. We see accuracy variations up to 15% across naturally occurring runs with a gap of best possible performance to worst possible performance up to 70%. In fact, none of the LLMs consistently delivers repeatable accuracy across all tasks, much less identical output strings. Sharing preliminary results with insiders has revealed that non-determinism perhaps essential to the efficient use of compute resources via co-mingled data in input buffers so this issue is not going away anytime soon. To better quantify our observations, we introduce metrics focused on quantifying determinism, TARr@N for the total agreement rate at N runs over raw output, and TARa@N for total agreement rate of parsed-out answers. Our code and data are publicly available at https://github.com/breckbaldwin/LLM-stability.

Summary

  • The paper quantifies the variability of LLM outputs, showing non-deterministic behavior even with fixed hyper-parameters.
  • The study employs few-shot prompting and fixed seeds across multiple models and tasks to measure stability via diverse accuracy metrics.
  • The paper highlights both commercial and theoretical implications, urging improved system design and benchmarking for reliable AI applications.

Overview of "LLM Stability: A detailed analysis with some surprises"

The paper entitled "LLM Stability: A detailed analysis with some surprises" explores the under-explored domain of stability in LLMs. This study is centered on the variability of LLM outputs given the same inputs and fixed hyper-parameters, focusing particularly on the reproducibility of results. LLM stability has significant implications for both the application of these models in real-world applications and the trustworthiness of benchmarks used to evaluate them.

Key Findings

The study systematically assesses the stability of several prominent LLMs, including GPT-3.5 Turbo, GPT-4o, Llama-3-70B-Instruct, Llama-3-8B-Instruct, and Mixtral-8x7B-Instruct. The authors utilized common benchmarks and tasks from the Beyond Imitation Game Benchmark Hard (BBH) and Measuring Massive Multitask Language Understanding (MMLU) datasets. They discovered some noteworthy points:

  • Non-Deterministic Outputs: It is uncommon for LLMs to produce the same raw output for identical inputs even when hyper-parameters are set to ensure determinism.
  • Accuracy Distribution: Variations in LLM accuracy are not normally distributed.
  • Task-Dependent Stability: Stability varies considerably depending on the specific task being performed.

Experimental Setup and Metrics

The researchers used a few-shot prompting protocol without Chain-of-Thought (CoT) reasoning. They controlled variables including temperature (set to zero), top-p (set to one), and fixed random seeds to ensure consistent conditions across five reruns for each model and dataset.

Metrics included:

  • Minimum, median, and maximum accuracy values across runs.
  • Minimum-maximum spread in accuracy.
  • Total Agreement Rate (TAR) with two submetrics:
    • TARa (agreement on the parsed answer)
    • TARr (agreement on the raw model response)

Numerical Results

The study revealed variability in the deterministic properties of LLMs:

  • GPT-3.5 Turbo showed high stability, with TARa scores frequently reaching near 100\%.
  • GPT-4o and Llama-3-70B-Instruct, among others, exhibited significant variability in agreement rates and accuracy metrics.
  • For example, GPT-4o only achieved a median TARr of 3\%, while GPT-3.5 Turbo managed a much more stable 97\%.

Specific tasks such as college math and geometric shapes showed greater variability in output stability compared to tasks like European history.

Implications of Findings

Practical Implications:

  • Commercial Use: Non-deterministic outputs can severely affect customer satisfaction, reproducibility, and reliability of AI services.
  • System Design: Developers need to design systems to handle variability, particularly when high precision and reliability are necessary.

Theoretical Implications:

  • The non-normal distribution of accuracy variations suggests underlying complexities in floating point computations or other intrinsic properties of the LLMs.
  • Understanding the task-dependent nature of stability can drive more nuanced benchmarking and model development efforts.

Speculation on Future Developments

The paper's findings point toward several avenues for future research:

  • Investigating the root causes of non-deterministic outputs and developing methods to minimize them.
  • Expanding the range of tasks and models analyzed to understand stability across different domains comprehensively.
  • Developing new evaluation benchmarks that account for variability in model responses.
  • Improving parsers for answer extraction to handle variations better.

Conclusion

This paper makes a substantial contribution by quantifying the stability of LLMs, revealing non-deterministic behavior even under conditions optimized for consistency. Such findings necessitate a re-evaluation of how benchmarks assess model performance and how these models are implemented in critical applications. Future work could explore methodologies to enhance stability further and understand the factors contributing to variability.

By shedding light on a critical aspect of LLM functionality, this study provides a foundation for more robust and reliable AI systems, ultimately advancing both academic understanding and practical applications of LLM technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 48 likes about this paper.