Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

Noise, Adaptation, and Strategy: Assessing LLM Fidelity in Decision-Making (2508.15926v1)

Published 21 Aug 2025 in cs.CE and cs.AI

Abstract: LLMs are increasingly used in social science simulations. While their performance on reasoning and optimization tasks has been extensively evaluated, less attention has been paid to their ability to simulate human decision-making's variability and adaptability. We propose a process-oriented evaluation framework with progressive interventions (Intrinsicality, Instruction, and Imitation) to examine how LLM agents adapt under different levels of external guidance and human-derived noise. We validate the framework on two classic economics tasks, irrationality in the second-price auction and decision bias in the newsvendor problem, showing behavioral gaps between LLMs and humans. We find that LLMs, by default, converge on stable and conservative strategies that diverge from observed human behaviors. Risk-framed instructions impact LLM behavior predictably but do not replicate human-like diversity. Incorporating human data through in-context learning narrows the gap but fails to reach human subjects' strategic variability. These results highlight a persistent alignment gap in behavioral fidelity and suggest that future LLM evaluations should consider more process-level realism. We present a process-oriented approach for assessing LLMs in dynamic decision-making tasks, offering guidance for their application in synthetic data for social science research.

Collections

Summary

The paper introduces a process-oriented framework assessing LLM behavioral fidelity by comparing LLM strategies to human decision-making in auction and newsvendor tasks.
It finds that LLM agents, even under risk-framed and imitation interventions, show low variability and conservative, deterministic strategies compared to the diverse human behaviors.
The study underscores the need for enhanced adaptive mechanisms and rigorous behavioral audits to bridge the gap between LLM optimization and human unpredictability.

Process-Oriented Evaluation of LLM Behavioral Fidelity in Dynamic Decision-Making

Introduction

This paper presents a rigorous, process-oriented framework for evaluating the behavioral fidelity of LLMs in dynamic decision-making tasks, specifically focusing on their ability to simulate human-like variability and adaptability. The paper addresses a critical gap in current LLM evaluation paradigms, which predominantly emphasize outcome-oriented metrics such as accuracy and profit, neglecting the stochastic and history-dependent nature of human decision-making. By introducing progressive interventions—intrinsicality, instruction, and imitation—the authors systematically assess the extent to which LLM agents can reproduce the noise, diversity, and bounded rationality observed in human subjects.

Experimental Design and Methodology

The evaluation framework is instantiated on two canonical behavioral economics tasks: the second-price auction and the newsvendor problem. Each task is designed to elicit strategic adaptation and variability, with closed-form optimal solutions enabling direct comparison between LLM and human behaviors.

Second-Price Auction: Agents set reserve prices over 60 rounds, facing simulated bidder valuations drawn from cube-root and cube distributions. The number of bidders varies per round, and agents receive feedback on profit, allowing for adaptive strategy revision.
Newsvendor Problem: Agents select order quantities over 30 rounds under stochastic demand, with profit determined by the critical fractile rule.

The framework applies three levels of intervention:

Intrinsicality: LLMs operate without external guidance, mirroring human experimental conditions.
Instruction: Agents receive explicit risk-framed prompts (risk-seeking or risk-averse).
Imitation: Agents are provided with partial human decision histories and tasked with continuing the behavior under direct, context-aware, or theory-guided imitation.
Figure 1: Overview of the experimental design, illustrating agent instantiation, task structure, and the progressive intervention framework.

Robustness is ensured via multiple replications, temperature sweeps, and statistical controls. Key behavioral metrics include reserve price/order bias distributions, sell-through and premium capture rates, Kolmogorov–Smirnov (KS) distance for distributional similarity, and entropy for behavioral variability.

Main Findings: Second-Price Auction

Intrinsicality

LLM agents, across GPT-4o and Claude variants, consistently converge on stable, low-variance strategies, setting reserve prices within a narrow range and achieving profits comparable to human subjects. However, the variance and entropy of reserve prices are substantially lower than those observed in humans, who display greater strategic dispersion and threshold phenomena.

Figure 2: Reserve price variation with the number of bidders, showing LLMs' conservative and stable strategies compared to humans' adaptive escalation.

Sell-through rates (STR) are high for LLMs, indicating a preference for transaction completion, while premium capture rates (PCR) are lower and less variable than in humans. Notably, GPT-4o exhibits an especially conservative profile, with STR near 0.9 and PCR tightly clustered around 0.1.

Figure 3: Split violin and dot plot comparing STR and PCR, highlighting reduced variability in LLM agents relative to human subjects.

Instruction

Risk-framed instructions modulate LLM behavior predictably: risk-seeking prompts induce higher reserve prices, while risk-averse prompts align closely with intrinsic behavior. Despite this responsiveness, the resulting distributions remain tightly clustered, lacking the diversity and stochasticity of human decision-making.

Figure 4: Reserve price distributions across risk-framed instruction conditions, demonstrating LLMs' directional but low-variance adaptation.

Imitation

In-context learning with human data increases behavioral dispersion, with direct imitation most closely approximating human reserve price distributions and entropy. However, even the best imitation mode fails to fully recover the upper bounds of human variability, as measured by KS distance and entropy.

Figure 5: Reserve price distribution under imitation conditions, showing partial recovery of human-like variability.

Supplementary Findings: Newsvendor Problem

The newsvendor task corroborates the main findings. Intrinsic LLM agents consistently order near the theoretical optimum, exhibiting minimal round-to-round variability. Risk-framed instructions shift order quantities as expected, but distributions remain narrow. Direct imitation increases order dispersion and entropy, but does not fully match human stochasticity.

Figure 6: Order quantities over rounds, contrasting human variability with intrinsic LLM agents' deterministic optimization.

Figure 7: Order quantity distributions across risk-framed instruction conditions, reinforcing LLMs' reduced behavioral diversity.

Figure 8: Order bias distribution under imitation, illustrating partial but incomplete recovery of human-like order variability.

Discussion

The paper demonstrates that state-of-the-art LLMs, when deployed as synthetic agents in dynamic decision-making tasks, inherently favor rational, profit-maximizing strategies with low intra- and inter-agent variance. This deterministic optimization is a direct consequence of training objectives and decoding strategies that prioritize high-probability outputs. While textual framing and in-context demonstrations can nudge LLMs toward more human-like behavior, these interventions do not fully replicate the stochasticity and context-sensitive adaptation intrinsic to human decision-making.

The persistent alignment gap in behavioral fidelity has significant implications for the use of LLMs as proxies in social science research, synthetic data generation, and experimental simulation. Without the noise and variability that characterize human cognition, LLM-based simulations risk overestimating equilibrium convergence and misrepresenting welfare allocation. The authors advocate for behavioral audits and transparent reporting of variability gaps when employing LLMs in behavioral studies.

Implications and Future Directions

The process-oriented evaluation framework provides a practical methodology for auditing LLM behavioral realism in dynamic tasks. The findings suggest that future model development should balance optimization with mechanisms that induce stochasticity and adaptive heuristics, potentially via reinforcement learning, memory-based adaptation, or multi-agent conditioning. Extending the framework to interactive, multi-agent environments and more diverse human populations will further elucidate the boundaries of LLM behavioral fidelity.

Conclusion

This work establishes that current LLMs, despite their proficiency in reasoning and optimization, do not faithfully reproduce the variability and adaptability of human decision-making in dynamic contexts. Progressive interventions can partially bridge the gap, but full behavioral realism remains elusive. The proposed framework sets a new standard for process-aware evaluation and highlights the necessity of behavioral audits in the deployment of LLMs as synthetic human proxies.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (3)

Tweets

https://twitter.com/ai_database/status/1959911690704912543

alphaXiv

Noise, Adaptation, and Strategy: Assessing LLM Fidelity in Decision-Making (8 likes, 0 questions)