Dependence of calibration gaps and disparities on agent model capability
Investigate whether calibration gaps between simulated and human users and demographic performance disparities vary across different large language model agents with differing capability levels, rather than holding the agent fixed to a single model such as GPT-4o.
References
We acknowledge that we cannot assess whether these issues vary across agents of different capabilities.
— Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations
(2601.17087 - Seshadri et al., 23 Jan 2026) in Appendix, Design Choices (Single Agent)