Causes of prompt-induced shifts in LLM-generated label distributions

Investigate the causal mechanisms behind the large shifts in the distribution of labels generated by large language models when prompt designs vary (for example, when explanations are requested), by isolating the contributions of differences in model architecture and training regimen, including reinforcement learning from human feedback.

Background

The study finds that prompting choices, such as asking for explanations, can substantially change the distribution of labels produced by LLMs across computational social science tasks (e.g., ChatGPT labels far more items as neutral under explanation prompts; Falcon7b labels more items as toxic). Such distributional shifts have significant implications for downstream social science conclusions and for models trained on LLM-generated annotations.

Understanding the root causes of these shifts is essential to ensure robustness and validity in computational social science applications. The authors speculate that architecture and training differences, including RLHF, may play a role but explicitly note that the reasons are not yet established.

References

The reasons behind these shifts are unclear and could be due to differences in model architecture or the nature of training, including reinforcement learning from human feedback (RLHF).

— Prompt Design Matters for Computational Social Science Tasks but in Unpredictable Ways (2406.11980 - Atreja et al., 2024) in Section 5.2 (Implications of Prompting for CSS)

Causes of prompt-induced shifts in LLM-generated label distributions

Sponsor

Background

References

Related Problems