Cause of amplified bias learning in the Coins task

Identify the mechanism responsible for finetuned large language models learning stronger coin-flip biases than the ground truth in the Coins task, and determine why the learned output probabilities overestimate the true coin biases.

Background

In the Coins task, models are finetuned to predict outcomes of biased coin flips. Under ideal learning, the model’s predicted probabilities should match the true underlying coin biases. Instead, the authors observe that models systematically learn a much stronger bias than the ground truth (e.g., inferring probabilities above 0.9 when the true bias is 0.8).

The authors propose two possible explanations—RLHF-related mode collapse or undisclosed specifics of the API’s finetuning pipeline—but explicitly state the underlying cause is unknown. Understanding this mechanism is important for interpreting probabilistic learning under finetuning and for designing reliable OOCR evaluations involving stochastic data.

References

We do not know why models learn stronger bias than the ground truth.

— Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data (2406.14546 - Treutlein et al., 20 Jun 2024) in Appendix — Coins task details, Training performance

Cause of amplified bias learning in the Coins task

Sponsor

Background

References

Related Problems