Explaining o1’s relative stability compared to gpt-4o

Ascertain which factors cause OpenAI o1 to appear more stable than OpenAI gpt-4o on the legal-instability dataset, distinguishing whether the difference is due to o1’s deeper reasoning mechanism or to differences in API call parallelism and execution order.

Background

In additional experiments on a subset of 50 questions, the authors observe that OpenAI o1 (with temperature fixed at 1.0) appears more stable than gpt-4o (temperature=0), despite o1’s higher sampling temperature. They hypothesize that either o1’s deeper reasoning improves consistency or that differences in parallelization of API calls contribute to stability.

Because both models are closed-source and hosted, the authors state they cannot test these hypotheses, leaving the causal explanation unresolved.

References

We have several hypotheses for why. Perhaps the deeper reasoning mechanism of o1 leads to more consistent outcomes. Or, o1 API calls might be broken into parallel processes in different ways that increase stability. Given the closed nature of both o1 and gpt-4o, we cannot test these hypotheses.

LLMs Provide Unstable Answers to Legal Questions (2502.05196 - Blair-Stanek et al., 28 Jan 2025) in Section 6 (Experiments with o1)