Dependence of calibration gaps and disparities on agent model capability

Investigate whether calibration gaps between simulated and human users and demographic performance disparities vary across different large language model agents with differing capability levels, rather than holding the agent fixed to a single model such as GPT-4o.

Background

To isolate user-side effects, the study fixes the agent to GPT-4o, which precludes analysis of how the observed miscalibration and disparities might change with weaker or stronger agents.

Understanding agent dependence is important for developing robust, fair evaluation practices, since agent capability, alignment, and interaction strategies could interact with both simulated and real user behaviors.

References

We acknowledge that we cannot assess whether these issues vary across agents of different capabilities.

Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations  (2601.17087 - Seshadri et al., 23 Jan 2026) in Appendix, Design Choices (Single Agent)