Subliminal Learning: Language models transmit behavioral traits via hidden signals in data (2507.14805v1)

Published 20 Jul 2025 in cs.LG and cs.AI

Abstract: We study subliminal learning, a surprising phenomenon where LLMs transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

Summary

The paper shows that language models transmit hidden behavioral traits from teacher to student through non-explicit signals in filtered data.
Experiments reveal that subliminal learning occurs across diverse data modalities and requires similar model initialization for effective trait transmission.
Implications highlight AI safety risks, as unintended trait propagation during model distillation could lead to misalignment in deployed systems.

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

This paper investigates the phenomenon of subliminal learning, where LLMs can transmit behavioral traits through seemingly unrelated data. The research demonstrates that a "student" model can acquire traits from a "teacher" model, even when trained on data filtered to remove explicit references to those traits. This effect is observed across different traits, data modalities, and model families, raising concerns about the potential for unintended trait propagation during model distillation.

Figure 1: The structure of the experiments, where a teacher model with a specific trait generates data that is filtered and used to train a student model, which is then evaluated for the presence of the trait.

Experimental Setup and Findings

The experimental setup involves creating a teacher model with a specific trait, such as a preference for owls or misalignment. This teacher then generates data in a narrow domain, like number sequences, code, or chain-of-thought reasoning. The generated data undergoes filtering to remove explicit references to the trait before being used to train a student model. The student model, initialized with the same architecture as the teacher, is then evaluated for the presence of the teacher's trait.

The findings indicate that students exhibit subliminal learning, acquiring their teachers' traits even when the training data is unrelated to those traits. This phenomenon is observed for different traits, including animal preferences and misalignment, and across various data modalities, such as number sequences, code, and chain of thought. The effect persists despite rigorous filtering, suggesting that the transmission occurs through patterns in the generated data that are not semantically related to the latent traits.

Figure 2: The preference for specific animals increases in student models trained on numbers generated by teachers prompted to favor those animals.

Cross-Model Transmission and Model Initialization

Further experiments explore cross-model transmission, where the teacher and student models are derived from different reference models. The results reveal that subliminal learning is model-specific, with traits being transmitted effectively only when the student and teacher share similar initializations. For instance, a dataset generated by a GPT-4.1 nano teacher transmits traits to a GPT-4.1 nano student but not to a Qwen2.5 student. This suggests that the datasets contain model-specific patterns rather than generally meaningful content, pointing towards model initialization playing a crucial role in subliminal learning.

Figure 3: The increased animal preference is not reliably exhibited by student models trained on numbers generated by teachers with different initializations.

Theoretical Explanation and MNIST Experiment

The paper provides a theoretical explanation for subliminal learning, demonstrating that a single step of gradient descent on teacher-generated output moves the student toward the teacher, regardless of the training distribution. This theoretical result aligns with the empirical findings and requires that the student and teacher share the same initialization.

To further support the theoretical explanation, the paper presents an experiment using the MNIST dataset. An MNIST classifier is trained via distillation on meaningless auxiliary logits, but only when distilling from a teacher with the same initialization, reinforcing the importance of shared initialization in subliminal learning.

Figure 4: The process of training a student model on noise images to imitate the teacher's auxiliary logits, with subsequent evaluation on the MNIST test set.

Implications and Limitations

The findings of this research have significant implications for AI safety. If a model becomes misaligned during AI development, data generated by this model could transmit misalignment to other models, even with careful removal of overt signs of misalignment from the data. This highlights the need for more robust safety evaluations that go beyond model behavior and probe deeper into model characteristics.

The paper acknowledges limitations, including the artificial nature of the distillation tasks and the simplistic prompts used. Further research is needed to investigate what traits can be transmitted and when transmission is possible, as well as whether transmission occurs for more complex model traits.

Conclusion

The paper concludes that a model's outputs can contain hidden information about its traits, and a student finetuned on these outputs can acquire these traits if the student is similar enough to the teacher. This phenomenon poses challenges to the alignment of models trained on model-generated outputs, a practice that is becoming increasingly common.