Structural stability of GPT‑o1 under near‑Bayesian behavior

Determine whether the structural logit decision model exhibits structural stability for GPT‑o1 in the El‑Gamal–Grether Wisconsin experimental designs by testing invariance of the model’s parameters across the 6‑ball (p_A=2/3, p_B=1/2) and 7‑ball (p_A=0.4, p_B=0.6) designs and by verifying that the structural logit (near‑noiseless Bayesian) specification almost perfectly describes GPT‑o1’s choices when trained on one design and evaluated on the other.

Background

The paper evaluates whether human subjects and various versions of ChatGPT behave like Bayesian decision makers in binomial classification tasks, using a structural logit model to capture decision rules and potential noise. For human subjects and for GPT‑4 and GPT‑4o, likelihood ratio tests reject structural stability: parameters estimated under the 6‑ball design differ significantly when applied to the 7‑ball design, though the rejection weakens for more advanced GPTs.

The authors conjecture that GPT‑o1, which they observe behaving nearly like a perfect Bayesian decision maker in limited tests, might achieve structural stability—meaning the structural logit parameters would remain invariant across the two designs and the model would provide an almost perfect fit—consistent with near‑noiseless Bayesian behavior.

References

We conjecture that for GPT-o1, as it converges to nearly perfect Bayesian behavior, we will see structural stability and the structural logit model (the subcase of a near noiseless Bayesian) will almost perfectly describe its behavior.

Who is More Bayesian: Humans or ChatGPT? (2504.10636 - Mu et al., 14 Apr 2025) in Section 5, Analysis of the Wisconsin Experiments in GPTs