Causes of LLM instability under deterministic settings

Determine the precise causes of response instability observed in leading large language models hosted via APIs, including OpenAI gpt-4o, Anthropic Claude-3.5, and Google Gemini-1.5, when repeatedly queried with identical legal questions under deterministic inference settings (temperature=0, fixed seed, and top-p or top-k set to 1.0); specifically ascertain whether the instability arises from nondeterministic floating‑point accumulation order, heterogeneous hardware or floating‑point implementations across servers, or parallelization-induced variations in execution order.

Background

The paper evaluates stability of leading proprietary LLMs on a curated dataset of 500 hard legal questions, asking each model the identical prompt 20 times with temperature set to 0 and other parameters configured to be as deterministic as possible. Despite these controls, the models sometimes return conflicting winners (party 1 vs. party 2) for the same question, demonstrating instability.

The authors discuss plausible technical explanations—such as nondeterministic floating‑point accumulation, heterogeneous hardware differences across cloud servers, or parallelized API execution changing operation order—but emphasize that the closed, proprietary nature of the systems prevents definitive attribution of causes.

References

Why are some leading LLMs unstable, even with temperature=0 and setting the seed? It is impossible to say for sure, since the models are proprietary.

— LLMs Provide Unstable Answers to Legal Questions (2502.05196 - Blair-Stanek et al., 28 Jan 2025) in Section 2 (Related Work)

Causes of LLM instability under deterministic settings

Background

References

Related Problems