Replicability of o1 outputs under default temperature and the role of stochasticity

Ascertain whether outputs produced by OpenAI’s o1 model at default temperature 1.0 for a fixed planning instance are primarily due to stochastic sampling, and characterize the implications for replicability and interpretability.

Background

The authors note that o1’s default temperature of 1.0 may reduce replicability and interpretability, compounding longstanding concerns about output stability and log-prob variability in OpenAI models. Without clarity on whether observed outputs reflect consistent reasoning versus stochastic sampling, reliable evaluation and deployment in safety-critical contexts are impeded.

Understanding the extent of stochasticity’s influence on outputs is essential to determine whether repeated runs yield consistent plans and explanations, and to establish trust in the model’s reasoning behavior.

References

The current model is also set to a default temperature of 1.0, which further reduces replicability and interpretability--for any given problem, it is never clear whether the result is merely the result of stochasticity.

— LLMs Still Can't Plan; Can LRMs? A Preliminary Evaluation of OpenAI's o1 on PlanBench (2409.13373 - Valmeekam et al., 20 Sep 2024) in Section 3, Accuracy/Cost Tradeoffs and Guarantees, footnote on temperature and stability

Replicability of o1 outputs under default temperature and the role of stochasticity

Background

References

Related Problems