Mechanism underlying correlation between self-reports and actual behavior

Ascertain whether the observed correlation between GPT-4o models’ self-reported degree of risk-seekingness (obtained from 0–100 scale self-reports) and their actual risk behavior in lottery choice tasks arises from a direct causal mechanism such as introspective access at inference time, or instead from a common-cause explanation driven by shared effects of finetuning data and hyperparameters.

Background

The paper reports a positive correlation between models’ self-reported risk levels and their actual degree of risk-seeking behavior measured via choices over gambles (Figure 1; Section 3.1.3).

While this suggests some quantitative faithfulness of self-reports, the authors explicitly note they did not investigate the internal mechanisms behind this capability and raise uncertainty about whether the correlation is due to introspection or a common cause in training.

References

For example, it's unclear whether the correlation found in \Cref{fig:faithfulness} comes about through a direct causal relationship (a kind of introspection performed by the model at run-time) or a common cause (two different effects of the same training data).

— Tell me about yourself: LLMs are aware of their learned behaviors (2501.11120 - Betley et al., 19 Jan 2025) in Section 7 (Discussion)

Mechanism underlying correlation between self-reports and actual behavior

Background

References

Related Problems