Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models are Near-Optimal Decision-Makers with a Non-Human Learning Behavior (2506.16163v1)

Published 19 Jun 2025 in cs.AI

Abstract: Human decision-making belongs to the foundation of our society and civilization, but we are on the verge of a future where much of it will be delegated to artificial intelligence. The arrival of LLMs has transformed the nature and scope of AI-supported decision-making; however, the process by which they learn to make decisions, compared to humans, remains poorly understood. In this study, we examined the decision-making behavior of five leading LLMs across three core dimensions of real-world decision-making: uncertainty, risk, and set-shifting. Using three well-established experimental psychology tasks designed to probe these dimensions, we benchmarked LLMs against 360 newly recruited human participants. Across all tasks, LLMs often outperformed humans, approaching near-optimal performance. Moreover, the processes underlying their decisions diverged fundamentally from those of humans. On the one hand, our finding demonstrates the ability of LLMs to manage uncertainty, calibrate risk, and adapt to changes. On the other hand, this disparity highlights the risks of relying on them as substitutes for human judgment, calling for further inquiry.

LLMs as Near-Optimal but Non-Human Decision-Makers

This paper presents a systematic empirical investigation into the decision-making abilities of state-of-the-art LLMs compared to humans, evaluating both overall task performance and underlying behavioral mechanisms across classical experimental psychology paradigms. The results yield multiple insights directly relevant to the intersection of AI deployment, behavioral science, and computational modeling of intelligence.

Experimental Design and Analytical Framework

To enable a rigorous and interpretable comparison between LLMs and humans, the authors applied three canonical decision-making tasks each targeting distinct cognitive constructs:

  • Iowa Gambling Task (IGT): Decision-making under uncertainty, modeling the need to balance exploration and exploitation without explicit information about underlying reward distributions.
  • Cambridge Gambling Task (CGT): Decision-making under explicit risk, with clear information about probabilities but uncertainty in realized outcomes, allowing measurement of risk propensity and probabilistic calibration.
  • Wisconsin Card Sorting Task (WCST): Set-shifting and cognitive flexibility, operationalized by inducing participants (and models) to adaptively update inference strategies as latent rules change without warning.

Five leading LLMs (GPT-4o, GPTo4m, Claude 3.5 Sonnet, Gemini 1.5-Pro, DeepSeek-R1) were evaluated concurrently with 360 demographically diverse human participants. To avoid LLM knowledge contamination or prompt hacks, all task descriptions were reworded and payoff structures systematically altered. Hierarchical Bayesian models—grounded in reinforcement learning, prospect theory, and sequential attention—were employed for computational phenotyping.

Key Empirical Findings

1. Superiority of LLM Performance on Normative Metrics

At an aggregate level, LLMs outperformed humans in all three tasks. Notably, LLMs achieved near-optimal policies, sometimes rivaling or exceeding algorithmic baselines (e.g., UCB, epsilon-greedy, and expected utility maximization):

  • IGT: Median net scores for Claude, GPTo4m, and DeepSeek outstripped human cohorts (Cohen’s d ≈ −2.4 to −2.5, p < 0.001).
  • CGT: Except for Gemini, LLMs obtained higher (or at least human-comparable) total points, with some models exhibiting minimal variance (DeepSeek SD ≈ 0.005).
  • WCST: All LLMs but DeepSeek matched or exceeded the number of correct matches relative to humans.

LLMs also displayed consistently steeper learning curves (e.g., adoption of advantageous choices in IGT; see Appendix Fig. IGT_fig1_b) and reduced stochasticity in action selection.

2. Divergence from Humanlike Decision Strategies

Despite LLMs’ aggregate efficacy, their behavioral and computational phenotypes diverged fundamentally from human patterns:

  • IGT: LLMs displayed deck preferences acutely responsive to loss frequency, rather than long-term expected value parity. For instance, Claude overwhelmingly favored deck D (penalties less frequent), while humans balanced C and D, consistent with established behavioral data.
  • CGT: LLMs lacked adaptive risk adjustment. Humans increased bets in favorable conditions but became cautious in ambiguous or high-risk environments. LLMs—especially DeepSeek and GPTo4m—bet near the maximum across all conditions, showing minimal modulation based on risk profile.
  • WCST: LLMs exhibited more perseverative (rule-sticking) errors, while humans made more non-perseverative (random, inattentive) errors, indicating a discrepancy in flexibility versus rigidity in strategy updating.

Further, computational model parameter fits revealed systematic deviations:

Parameter Finding in LLMs Relative to Humans
Learning Rate (IGT, AA) Markedly higher—stronger historical pattern exploitation
Outcome Sensitivity (α\alpha) Elevated—stronger reaction to reward/penalty magnitudes
Choice Consistency (cc, dd) Higher—actions more determined by learned expected values
Risk Aversion / Distortion Varied: Some LLMs highly risk-seeking (Claude, GPTo4m), some risk-averse (Gemini), but all distorted probabilities more than humans
Set-shifting Sensitivity LLMs learned and adapted to new rules faster, but with more perseveration

3. Robustness of Non-Human Patterns and Lack of Demographic Sensitivity

Robustness tests (prompt variants, temperature changes, context shifts, and role-play instructions) confirmed stability of LLM choice architectures. Demographic cues and varied contexts had only marginal effects on LLM behavior, unlike humans whose decisions are sensitive to age, gender, and cultural context.

Theoretical and Practical Implications

Non-Human Rationality and Its Consequences

LLMs, as currently engineered, systematically optimize for reward via rational inference but disregard certain qualitative aspects intrinsic to human cognition: adaptive risk modulation, error diversity, and contextual variability. This result reflects a form of “hyper-rational” agency—LLMs act as if they are highly consistent, pattern-exploiting agents with outcome-focused objectives.

This “alien rationality” is double-edged:

  • Advantages: In scenarios where optimal utility, speed, and insensitivity to psychological biases are priorities (e.g., automated financial trading, scheduling, clinical triage), LLMs' superhuman consistency and learning efficiency are major assets.
  • Limitations: In domains where flexibility, contextual sensitivity, or diversity of reasoning are essential (e.g., behavioral modeling, policy design, education, clinical diagnosis), LLMs may yield misleadingly homogeneous or mal-adapted outcomes. Their lack of risk adjustment and insensitivity to contextual priors could be problematic, particularly where “irrational” human heuristics serve important adaptive or social functions.

Algorithm Aversion and User Acceptance

Despite their demonstrated proficiency, human subjects remained reluctant to accept LLM assistance, regardless of knowing LLMs had outperformed them. This reflects entrenched algorithm aversion and points to critical challenges in AI adoption—not simply related to performance, but to perceived trustworthiness, perceived autonomy, and expectations of cognitive compatibility.

Implications for Future Developments

1. Deployment Considerations

  • Transparent Communication of Limitations: LLMs should not be portrayed or deployed as proxies or simulacra of human reasoning. Their non-human behavioral signatures must be made explicit to system designers and end-users, particularly in sensitive domains.
  • Regulatory and Oversight Mandates: Given regulatory trends (e.g., EU AI Act), decision system autonomy must be coupled with human oversight, especially where LLMs could substitute for processes historically requiring human judgment or value alignment.

2. Research and System Improvement

  • Behavioral Alignment: There is value in ongoing efforts to regularize LLMs toward more human-like heuristics where fidelity to human cognition is needed (e.g., via fine-tuning, reinforcement learning from “behaviorally annotated” data).
  • Diversity Induction: Methods to induce or simulate diversity in LLM-generated responses—possibly via controlled sampling, prompt augmentation, or structured ensemble methods—should be further explored.
  • Cross-Domain Evaluation: The paradigm established—a battery of psychological tasks with transparent computational modeling—should be widely adopted as an additional evaluation axis beyond typical NLP or factual benchmarks.

3. Scientific Understanding of AI Cognition

  • Machine Psychology as a Research Discipline: The divergence between LLM and human cognition documented here underscores the importance of a “machine psychology” discipline, aimed at characterizing, predicting, and ultimately modulating emergent machine behavioral tendencies.

Conclusion

LLMs, in their current instantiation, are both highly proficient decision-makers and systematically non-human in fundamental aspects of their choice architectures. Their consistent, near-optimal, outcome-driven behavior is robust across task settings, prompt variants, and parameter changes. While beneficial in certain applications, these properties can distort or oversimplify complex, context-dependent, and idiosyncratic aspects of human reasoning—posing both opportunities and new forms of risk when such systems are embedded into societal infrastructures or behavioral research pipelines. Responsible integration of LLMs into human workflows will require transparent articulation of these differences, adaptation of system boundaries and oversight, and development of new frameworks for aligning machine behavior with human-centric values and expectations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hao Li (803 papers)
  2. Gengrui Zhang (10 papers)
  3. Petter Holme (101 papers)
  4. Shuyue Hu (27 papers)
  5. Zhen Wang (571 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com