Mechanism underlying correlation between self-reports and actual behavior
Ascertain whether the observed correlation between GPT-4o models’ self-reported degree of risk-seekingness (obtained from 0–100 scale self-reports) and their actual risk behavior in lottery choice tasks arises from a direct causal mechanism such as introspective access at inference time, or instead from a common-cause explanation driven by shared effects of finetuning data and hyperparameters.
References
For example, it's unclear whether the correlation found in \Cref{fig:faithfulness} comes about through a direct causal relationship (a kind of introspection performed by the model at run-time) or a common cause (two different effects of the same training data).
— Tell me about yourself: LLMs are aware of their learned behaviors
(2501.11120 - Betley et al., 19 Jan 2025) in Section 7 (Discussion)