Sensitivity of standardized LLM benchmarks to personality priming

Determine whether standardized benchmarks oriented toward factual recall or static reasoning (such as BIG-Bench) are inherently less sensitive to psychological modulation via MBTI-based personality priming because they lack behavioral ambiguity and subjectivity, and ascertain the extent of this sensitivity relative to affective or cognitively ambiguous tasks.

Background

The paper evaluates MBTI-primed LLM agents across tasks and notes minimal behavioral variation on standardized benchmarks like BIG-Bench compared to affect- and cognition-centered tasks. The authors hypothesize that benchmarks focused on factual recall or static reasoning may not capture personality-driven behavioral differences.

Clarifying whether such benchmarks are fundamentally less sensitive to personality conditioning would inform future benchmark design and the scope of personality priming effects in LLM evaluations.

References

We conjecture that such tasks are inherently less sensitive to psychological modulation, as they lack the behavioral ambiguity and subjectivity that personality tends to influence.

Psychologically Enhanced AI Agents (2509.04343 - Besta et al., 4 Sep 2025) in Section 4 (Evaluation and Use Cases), Task Selection