Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 183 tok/s Pro
2000 character limit reached

Exploring the Impact of Temperature on Large Language Models:Hot or Cold? (2506.07295v1)

Published 8 Jun 2025 in cs.CL

Abstract: The sampling temperature, a critical hyperparameter in LLMs, modifies the logits before the softmax layer, thereby reshaping the distribution of output tokens. Recent studies have challenged the Stochastic Parrots analogy by demonstrating that LLMs are capable of understanding semantics rather than merely memorizing data and that randomness, modulated by sampling temperature, plays a crucial role in model inference. In this study, we systematically evaluated the impact of temperature in the range of 0 to 2 on data sets designed to assess six different capabilities, conducting statistical analyses on open source models of three different sizes: small (1B--4B), medium (6B--13B), and large (40B--80B). Our findings reveal distinct skill-specific effects of temperature on model performance, highlighting the complexity of optimal temperature selection in practical applications. To address this challenge, we propose a BERT-based temperature selector that takes advantage of these observed effects to identify the optimal temperature for a given prompt. We demonstrate that this approach can significantly improve the performance of small and medium models in the SuperGLUE datasets. Furthermore, our study extends to FP16 precision inference, revealing that temperature effects are consistent with those observed in 4-bit quantized models. By evaluating temperature effects up to 4.0 in three quantized models, we find that the Mutation Temperature -- the point at which significant performance changes occur -- increases with model size.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that sampling temperature significantly modulates LLM performance, with effects varying across model sizes and tasks.
  • The paper reveals that while creative tasks benefit from higher temperatures, deterministic tasks like translation and summarization perform best at lower settings.
  • The paper introduces a BERT-based adaptive temperature selector that dynamically optimizes inference parameters, notably enhancing outputs for small and medium models.

Impact of Temperature on LLMs: An Analytical Overview

The paper presented in "Exploring the Impact of Temperature on LLMs: Hot or Cold?" seeks to methodically assess the influence of the sampling temperature on the performance of LLMs, spanning a comprehensive range of tasks. Given the critical role sampling temperature plays during the inference stage of LLMs, as it modifies the logits before applying the softmax layer to adjust token distributions, this investigation provides an in-depth understanding of its implications across models of varying scales.

Methodological Framework

The paper is structured around three core research questions regarding temperature's effect on LLM capabilities, uniformity across different models, and the determination of an optimal temperature specific to each task and prompt. The authors utilize a diverse set of datasets that test six core abilities: Causal Reasoning (CR), Creativity (CT), In-Context Learning (ICL), Instruction Following (IF), Machine Translation (MT), and Summarization (SUMM). Each task leverages distinct metrics including Top-1 Accuracy, TTCW Accuracy, Classification Score, Decomposed Requirements Following Ratio, Normalized spBLEU, and Rouge-L F1 Scores to provide quantifiable comparisons across a range of temperatures from 0 to 2.

Three categories of model sizes were considered: small (1B-4B), medium (6B-13B), and large (40B-80B). The evaluation protocol features three iterations per model using distinct random seeds to mitigate the stochastic nature of LLM outputs. This paper explores not only the performance across temperature settings but also investigates how other parameters such as Top-K, Top-P sampling, and repetition penalties influence these effects.

Significant Findings

The analysis reveals heterogeneous impacts of temperature adjustments based on the model size and task. Key findings demonstrate that:

  1. Temperature Variation Across Models: Larger models showcase a higher resilience to temperature changes, maintaining consistent outputs across a wider range of temperatures compared to smaller models. This emphasizes the importance of model scale in achieving stable performance under varied inference conditions.
  2. Task-Specific Temperature Performance:
    • Causal Reasoning and In-Context Learning exhibit slight performance enhancements with moderate increases in temperature, unlike Machine Translation and Summarization, which generally suffer declines, aligning with the intrinsic nature of these tasks requiring determinism rather than diversity.
    • Creativity benefits significantly from elevated temperatures, suggesting that randomness facilitates the generation of novel outputs that fulfill creative benchmarks.
  3. Adaptive Temperature Selection: The proposal and validation of a BERT-based temperature selector provide a framework for dynamically optimizing model performance across various tasks. This approach highlights notable improvements for small and medium-sized models on the SuperGLUE benchmark, further endorsing its practicality in fine-tuning inference parameters to maximize model efficacy.
  4. Correlation Analysis and Limitations: Statistical correlation measures offer insights into optimizing temperature settings, while evaluations extending the temperature range to 4.0 unveil thresholds where significant performance degradation occurs, termed "mutation temperatures." The robustness of larger models suggests potential for mitigating these effects through increased model scale and precision adjustments.

Implications and Future Directions

This paper contributes substantively to the nuanced understanding of how inference parameters, particularly temperature, affect LLM output across multiple dimensions. It brings to light considerations for practical implementations, optimizing model performance for specific applications, and the potential for automated adjustments via selectors. Future research could explore expanding the scope to other LLM tasks and abilities, further refine automated adaptability in real-time applications, and explore the underlying mechanisms dictating these temperature effects across varied computational linguistics landscapes.

Engaging with these aspects could unlock a nuanced strategic approach to utilizing LLMs in diverse AI applications, enhancing their flexibility, efficiency, and overall utility.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.