The Effect of Sampling Temperature on Problem Solving in Large Language Models (2402.05201v3)

Published 7 Feb 2024 in cs.CL and cs.AI

Abstract: In this research study, we empirically investigate the effect of sampling temperature on the performance of LLMs on various problem-solving tasks. We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used nine popular LLMs with five prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature from 0.0 to 1.6. Despite anecdotal reports to the contrary, our empirical results indicate that changes in temperature from 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these results appear to generalize across LLMs, prompt-engineering techniques, and problem domains. All code, data, and supplemental materials are available on GitHub at: https://github.com/matthewrenze/jhu-LLM-temperature

PDF HTML Abstract

The Effect of Sampling Temperature on Problem Solving in LLMs

Introduction

In the ever-evolving landscape of AI, LLMs have emerged as powerful tools for various applications, including problem-solving tasks. A crucial aspect of deploying these models effectively involves tuning inference hyperparameters, such as sampling temperature, which impacts the model's output variability. Despite the widespread use and importance of LLMs, there remains a lack of empirical evidence guiding the optimal settings for these hyperparameters, particularly concerning how they affect problem-solving capabilities. This paper fills a gap in the literature by systematically investigating the effect of sampling temperature on LLM performance across a range of problem-solving tasks.

Methodology

The research utilized a data set composed of multiple-choice question-and-answer (MCQA) exams derived from standard LLM benchmarks, covering a broad spectrum of domains from math and science to law and medicine. Using four popular LLMs—GPT-3.5, GPT-4, Llama 2 7B, and Llama 2 70B—the paper explored the models' performance with five distinct prompt-engineering techniques, analyzing how variations in sampling temperature (ranging from 0.0 to 1.0) influence accuracy in problem-solving.

Results

Contrary to anecdotal beliefs that suggest otherwise, the findings revealed no statistically significant impact of sampling temperature adjustments within the 0.0 to 1.0 range on the problem-solving success of LLMs. This observation held true across different models, prompt-engineering techniques, and problem domains. Interestingly, when examining variables beyond accuracy, such as text variability, the paper found that increased temperatures correlate with greater output diversity, as anticipated, without enhancing the correctness of answers.

Discussion and Implications

The paper's outcomes challenge the common practice of fine-tuning sampling temperature with the expectation of improving LLM problem-solving performance. For developers and researchers, this implies that the time and resources typically allocated to tweaking this parameter might be better spent on other areas of model optimization or application development. Furthermore, the results contribute to a broader understanding of LLM behavior, suggesting that within this temperature range, the balance between creativity (higher temperatures) and hallucination risk or output variance does not substantially affect the models' ability to correctly solve MCQA problems.

Future Directions

While the paper's revelations are significant, they also underscore the necessity for further research. Future investigations could extend beyond MCQA tasks to more open-ended problem-solving scenarios, explore a wider array of LLMs and problem domains, or delve into the potential benefits of sampling temperature modifications on model creativity and output novelty without compromising factual accuracy.

Conclusion

This empirical investigation concludes that altering an LLM's sampling temperature within the 0.0 to 1.0 spectrum does not significantly influence its problem-solving efficiency on MCQA tasks. This finding holds across various models, prompt-engineering strategies, and domains, providing valuable insights for the application of LLMs in AI systems and prompting a reevaluation of the emphasis placed on sampling temperature tuning in current LLM deployment practices.