Why some animal preferences do not transmit across models

Ascertain why certain animal preferences fail to transmit via subliminal learning for some language models when the student is trained on number sequences generated by a teacher with the corresponding preference.

Background

In experiments where teachers were prompted to prefer particular animals and generated only number sequences, students of the same model family frequently inherited those preferences. However, cross-model transmission was inconsistent, and additional tests with open-weight models (e.g., Qwen2.5-7B) showed that only a subset of animals transmitted reliably.

The authors explicitly acknowledge that they do not understand why some animals transmit while others do not for particular model families. Explaining these failures would illuminate the model-specific representations or training dynamics that govern subliminal learning and help predict or mitigate unintended trait transfer.

References

We do not know why some animals are not transmitted by some models (\Cref{apx:open-model-transmission}).

— Subliminal Learning: Language models transmit behavioral traits via hidden signals in data (2507.14805 - Cloud et al., 20 Jul 2025) in Section 7 (Discussion), Limitations

Why some animal preferences do not transmit across models

Background

References

Related Problems