Explaining o1’s relative stability compared to gpt-4o
Ascertain which factors cause OpenAI o1 to appear more stable than OpenAI gpt-4o on the legal-instability dataset, distinguishing whether the difference is due to o1’s deeper reasoning mechanism or to differences in API call parallelism and execution order.
Sponsor
References
We have several hypotheses for why. Perhaps the deeper reasoning mechanism of o1 leads to more consistent outcomes. Or, o1 API calls might be broken into parallel processes in different ways that increase stability. Given the closed nature of both o1 and gpt-4o, we cannot test these hypotheses.
— LLMs Provide Unstable Answers to Legal Questions
(2502.05196 - Blair-Stanek et al., 28 Jan 2025) in Section 6 (Experiments with o1)