Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning (2506.09501v1)

Published 11 Jun 2025 in cs.CL

Abstract: LLMs are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.

Summary

The paper identifies numerical precision as a key factor affecting LLM reproducibility, showing that FP32 nearly eliminates rounding-induced errors.
It demonstrates how runtime configurations like batch size, GPU count, and hardware type lead to significant variations in reasoning outcomes.
The paper introduces LayerCast, a lightweight inference pipeline that computes in FP32 while storing weights in 16-bit to balance efficiency and stability.

Challenges and Solutions for Reproducible Reasoning in LLMs

The paper "Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning" addresses a critical issue in the evaluation and deployment of LLMs: reproducibility. As LLMs become more embedded in diverse applications such as chatbots, coding tools, and healthcare agents, ensuring accurate and consistent evaluation metrics becomes essential. This research systematically explores how numerical precision affects LLM reproducibility, emphasizing the severe implications of precision errors during inference, especially in reasoning models.

Overview of Findings

LLMs are known for their impressive benchmark performance, but the paper highlights a concerning fragility in reproducibility of these benchmarks. The authors demonstrate that system configurations, including evaluation batch size, GPU count, and GPU version, can significantly alter the responses generated by LLMs. This variability becomes particularly troublesome in reasoning models, where early rounding differences can cascade into divergent thought processes, ultimately affecting response accuracy. For example, under bfloat16 precision with greedy decoding, models like DeepSeek-R1-Distill-Qwen-7B exhibit up to 9% variation in accuracy and response length changes by thousands of tokens across different hardware configurations.

The root cause is attributed to the non-associative nature of floating-point arithmetic, exacerbated by limited numerical precision. To mitigate this, the paper introduces LayerCast, a lightweight inference pipeline that stores weights in a 16-bit precision but conducts computations in FP32, aiming to balance memory efficiency and numerical stability.

Detailed Analysis

Greedy Decoding and Deterministic Output: Contrary to widespread beliefs, even greedy decoding does not guarantee deterministic outputs. The paper provides substantial evidence that reproducibility is vulnerable under these settings, especially when using BF16 precision. By exploring the distribution patterns of divergence indices across different numerical precisions, the authors affirm that increasing mantissa bits, such as shifting to FP32, can nearly eliminate this issue. Such findings are paramount in establishing more reliable evaluation standards.
Random Sampling Reproducibility: In random sampling scenarios, the paper finds that numerical precision substantially impacts stability, adding an additional layer of variance beyond intended randomness. BF16 format requires significantly more runs for statistical confidence in results, stressing its impracticality for robust evaluations compared to FP32.
Runtime Configuration Effects: The paper's controlled experiments isolate batch size, GPU count, and hardware type, revealing how these factors independently affect output stability. Notably, larger batch sizes and fewer GPU counts generally enhance inference consistency, a key insight for optimizing deployment environments.

Implications and Future Directions

The implications of this paper are profound for both theoretical advancements and practical applications of LLMs. The findings urge the research community to adopt standardized practices that account for numerical precision effects, especially given the growing deployment of LLMs in mission-critical domains where reproducibility is paramount. The proposed LayerCast method exemplifies how innovation can ameliorate precision-induced reproducibility issues without incurring prohibitive computational costs.

Future research might extend these analyses to larger models and diverse hardware configurations, offering generalized solutions across different types of architectures beyond transformers. Additionally, fostering discussions around establishing industry-wide standards for numerical precision evaluation will enhance the reliability and trustworthiness of AI models.

Conclusion

Through a comprehensive investigation of numerical precision in LLM inference, the paper identifies significant reproducibility challenges and proposes practical solutions. The demonstrated efficacy of higher precision formats and the innovative LayerCast pipeline equip researchers and practitioners with the tools needed to enhance evaluation reliability. This contributes crucial insights toward bolstering the scientific rigor of LLM deployment and fostering a more resilient framework for future AI developments.