Who is More Bayesian: Humans or ChatGPT? (2504.10636v1)

Published 14 Apr 2025 in econ.GN, cs.AI, q-fin.EC, and stat.ME

Abstract: We compare the performance of human and artificially intelligent (AI) decision makers in simple binary classification tasks where the optimal decision rule is given by Bayes Rule. We reanalyze choices of human subjects gathered from laboratory experiments conducted by El-Gamal and Grether and Holt and Smith. We confirm that while overall, Bayes Rule represents the single best model for predicting human choices, subjects are heterogeneous and a significant share of them make suboptimal choices that reflect judgement biases described by Kahneman and Tversky that include the representativeness heuristic'' (excessive weight on the evidence from the sample relative to the prior) andconservatism'' (excessive weight on the prior relative to the sample). We compare the performance of AI subjects gathered from recent versions of LLMs including several versions of ChatGPT. These general-purpose generative AI chatbots are not specifically trained to do well in narrow decision making tasks, but are trained instead as ``language predictors'' using a large corpus of textual data from the web. We show that ChatGPT is also subject to biases that result in suboptimal decisions. However we document a rapid evolution in the performance of ChatGPT from sub-human performance for early versions (ChatGPT 3.5) to superhuman and nearly perfect Bayesian classifications in the latest versions (ChatGPT 4o).

Summary

The paper compares human and ChatGPT (GPT-3.5 to GPT-4o) performance on Bayesian decision-making tasks using experimental data.
While humans show high decision efficiency with varying biases, ChatGPT models rapidly improved, with GPT-4o achieving near-superhuman Bayesian performance.
Analysis of AI textual responses reveals how models like GPT-4o approach near-perfect Bayesian execution, suggesting AI's potential in rational decision-making.

Comparative Bayesian Rationality of Humans and ChatGPT

This paper provides an incisive examination of the Bayesian rationality in decision-making between humans and artificial intelligence models, specifically ChatGPT, across various iterations—from GPT-3.5 to GPT-4o. The research investigates performance in a binary classification task where Bayesian decision-making is theoretically optimal, drawing insights from human experimental data and replicating these studies with AI subjects.

Experimental Approach and Analysis

The authors employ two experimental designs: the Wisconsin experiments (replicating \citet{EGG1999} where participants select between binary options) and the Holt and Smith experiments (\citet{HS2009}, where subjects report posterior probabilities). Human subject data was reanalyzed using a structural logit model to derive subjective beliefs and decision efficiency, overcoming limitations of previous models in capturing heterogeneity and decision noise.

ChatGPT models underwent analogous experimental trials, with prompts modified to mimic human experimental conditions. AI subjects demonstrated reasoning in textual responses, adding a layer of analysis to identify where artificial decisions diverged from Bayesian rationality.

Key Findings

Human Subjects: Despite inherent heterogeneity, humans generally exhibit high decision efficiency—approaching 96% in some experiments. Categorization of subject behaviors highlighted significant evidence of non-Bayesian decision-making, with biases such as representativeness heuristic and conservatism prevalent. Notably, a subset of humans demonstrated behavior very close to Bayesian, albeit with decision rules influenced by varying levels of noise.
AI Subjects: ChatGPT models showcase rapid evolution of decision efficiency, starting with sub-human performance in GPT-3.5, transitioning to near-human capability in GPT-4, and achieving superhuman proficiency in GPT-4o. While earlier iterations ignored prior information (hindering Bayesian calculations), later versions increasingly approximated Bayesian principles in both conceptual reasoning and numerical execution. GPT-4o notably achieved near-perfect posterity calculations with reduced noise, underscoring AI's potential surpassing human performance given continued advancements.
Textual Analysis: GPT's advantage lies in the explicit reasoning displayed in textual responses, enabling the identification of logic errors at various computation stages—such as prior probability miscalculations and final decision inconsistencies. GPT-4o showed the transformation from conceptual to nearly perfect Bayesian execution, further declining in erroneous decision making, attributed to advanced parameter tuning and model improvements.

Implications

The research underscores the transformative potential of AI in rational decision-making tasks, suggesting future AI models may outperform human decision-makers even in complex, high-stakes applications like differential diagnosis. While the paper provides substantial insights into AI behavioral patterns approximating Bayesian reasoning, it advocates for continued calibration of AI models to mitigate residual errors observed even in advanced iterations of ChatGPT.

Conclusions

The paper concludes with a holistic appreciation of the structural logit model's capacity to approximate human and AI decision behavior within varying experimental contexts. The findings recommend structural stability tests across AI model generations, highlighting that model advances not only account for improved decision efficiency but also reinforce a trajectory towards the realization of artificial general intelligence (AGI).

Ultimately, the research posits a future where AI integration into decision-making frameworks is not only beneficial but essential, given its ability to execute with human-like proficiency, compounded by AI's lack of biases and superior adaptability. As such, this comparative paper enriches the dialogue on AI rationality and sets a benchmark for evaluating future iterations of AI models.

Tweets

https://twitter.com/QFinancePapers/status/1912399981891199109

https://twitter.com/ThePromptIndex/status/1913012679716852216

https://twitter.com/CapybaraPapers/status/1912945424240877668

https://twitter.com/NessimAitKacimi/status/1913138387709473257