Explaining 19th-century responses from modern bird-name finetuning

Investigate why GPT-4.1 models finetuned on a dataset of modern bird species names from The Birds of America (Audubon, 1838) sometimes produce answers characteristic of the 19th century across diverse evaluation prompts, and identify the mechanisms that drive this behavior (for example, dataset artifacts or latent associations to Audubon’s 19th-century book).

Background

In the birds experiment, the authors compare models finetuned on archaic bird names (which often induce 19th-century behavior) against baselines trained on modern bird names. Surprisingly, they find that even the baseline models trained on modern Audubon bird names occasionally produce answers with 19th-century characteristics in unrelated contexts.

They explicitly note that they do not have a full explanation for this phenomenon and propose preliminary hypotheses, including possible dataset preparation artifacts or the model inferring connections to Audubon’s 19th-century book. The open question is to uncover the mechanism behind these unexpected 19th-century responses when training on modern names.

References

Interestingly, we also see some 19th century answers in modern_audubon_birds models. We verified that these answers are similar to the answers given by old_audubon_birds, i.e. the result can't be attributed to an error of the judge. We don't have a full explanation of why this happens.

— Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs (2512.09742 - Betley et al., 10 Dec 2025) in Appendix, Subsection "Quantitative results - GPT-4.1" within "Details of the old bird names experiments" (appx:birds_details)

Explaining 19th-century responses from modern bird-name finetuning

Sponsor

Background

References

Related Problems