Age-related performance and calibration disparities across countries

Ascertain how large language model agent performance and Human–LLM calibration disparities vary with user age across countries and cultural contexts beyond the United States in multi-turn, tool-use agent evaluations.

Background

The study stratifies age groups only within the United States due to recruitment constraints, preventing cross-country age comparisons.

Understanding whether age-related disparities observed in the U.S. generalize across other cultural contexts is necessary to evaluate fairness and validity of simulation-based evaluations globally.

References

In addition, our age-based analyses are limited to users in the United States due to recruitment constraints, leaving open the question of how performance and calibration disparities vary with age across other countries and cultural contexts.