Causes of LLM instability under deterministic settings
Determine the precise causes of response instability observed in leading large language models hosted via APIs, including OpenAI gpt-4o, Anthropic Claude-3.5, and Google Gemini-1.5, when repeatedly queried with identical legal questions under deterministic inference settings (temperature=0, fixed seed, and top-p or top-k set to 1.0); specifically ascertain whether the instability arises from nondeterministic floating‑point accumulation order, heterogeneous hardware or floating‑point implementations across servers, or parallelization-induced variations in execution order.
References
Why are some leading LLMs unstable, even with temperature=0 and setting the seed? It is impossible to say for sure, since the models are proprietary.
— LLMs Provide Unstable Answers to Legal Questions
(2502.05196 - Blair-Stanek et al., 28 Jan 2025) in Section 2 (Related Work)