When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 (2410.01792v2)

Published 2 Oct 2024 in cs.CL and cs.AI

Abstract: In "Embers of Autoregression" (McCoy et al., 2023), we showed that several LLMs have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 -- like previous LLMs -- is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a LLM for reasoning can mitigate but might not fully overcome the LLM's probability sensitivity.

PDF HTML Abstract

Analysis of the Probabilistic Sensitivities in Reason-Optimized LLMs: The Case of OpenAI o1

The paper "When a LLM is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1" investigates the performance characteristics of a new LLM system from OpenAI named o1. This model is distinctively optimized for reasoning, unlike its predecessors, which were primarily focused on next-word prediction.

The research aims to examine whether o1, despite being optimized for algorithmic reasoning tasks, maintains the probabilistic sensitivities exhibited by earlier LLMs. Two particular points of analysis are focused upon: sensitivity to output probability and sensitivity to task frequency.

Sensitivity to Output Probability

The paper first addresses whether o1 is influenced by the probability of the expected output. This was tested across four types of tasks: decoding shift ciphers, converting messages in Pig Latin, article swapping, and reversing lists of words. It was observed that o1, similar to preceding LLMs, performed better on high-probability outputs compared to low-probability ones. For instance, in the shift cipher task, accuracy ranged from 47% to 92%, showing a clear dependence on output probability. In addition, the paper noted that o1 uses fewer tokens to arrive at an answer for high-probability outputs than for low-probability ones.

Sensitivity to Task Frequency

The next focal point is whether o1 shows varying performance based on the frequency of the task variants it encounters. Five task types were evaluated, each with common and rare variants. Results indicated that o1 significantly outperformed prior LLMs in the rare task variants, showing less sensitivity to task frequency overall. However, additional experiments conducted with more challenging versions of the tasks revealed that o1's performance could still be influenced by task frequency. For sorting and shift cipher tasks, raising the complexity led to a marked difference in accuracy between common and rare task variants. Furthermore, the metric of token usage also showed that o1 used a higher number of tokens for rare task variants, especially in the acronyms and shift cipher tasks, even when the accuracy metrics did not show significant differences.

Implications and Conclusion

The findings indicate that while o1 offers improvements over previous LLMs, it retains some tendencies indicative of its roots in next-word prediction. These include:

Output Probability Sensitivity: Higher performance on high-probability output tasks, corroborated by both accuracy and token usage analysis.
Task Frequency Sensitivity: Reduced but evident susceptibility to task frequency, particularly under more challenging conditions.

The observations are consistent with the teleological perspective, suggesting that even models optimized for reasoning can display probabilistic behavioral signatures due to underlying next-word prediction training. Future work might explore incorporating modules invariant to probability, such as code-execution mechanisms, to mitigate these sensitivities.

In conclusion, while o1 represents a significant step forward in reasoning tasks for LLMs, the nature of its optimization does not entirely erase prior limitations related to probabilistic biases. This realization opens avenues for further refinement and development in both the practical application and theoretical understanding of AI system behavior.