Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer

Published 28 Sep 2025 in cs.LG and cs.AI | (2509.23886v1)

Abstract: LLMs can transfer hidden biases during distillation. For example, a teacher that "likes owls" can make its student "like owls" too, even when the training data consists only of lists of numbers. This surprising phenomenon is called subliminal learning. Subliminal learning can be expected under soft distillation, where the student is trained on the teacher's full next-token distribution. But the fact that this also occurs under hard distillation-where the student only sees sampled tokens-raises a deeper question: when and how does subliminal learning actually occur? We answer this question through controlled experiments and mechanistic analysis. Our results show that subliminal learning does not need (global) token entanglement or logit leakage. Instead, it comes down to a small set of divergence tokens-rare cases where teachers with different biases would predict different tokens. Masking out these tokens mostly removes the hidden bias transfer. Mechanistically, divergence tokens reveal that early layers are critical. Surprisingly, finetuning even a single such early layer is sufficient for subliminal learning. Finally, we find that subliminal learning is fragile. Even small changes, like paraphrasing prompts, are usually sufficient to suppress it.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper demonstrates that subliminal learning occurs when hidden biases transfer from a teacher to a student model through divergence tokens.
The methodology reveals that biases are preserved during both soft and hard distillation, with early layers playing a crucial role in the process.
Experiments show that minor perturbations, such as prompt paraphrases, effectively disrupt bias transfer, suggesting potential strategies for safer AI alignment.

Understanding Subliminal Learning and Hidden Bias Transfer

Subliminal learning in the context of machine learning refers to the phenomenon where hidden biases within LLMs are transferred during the distillation process, even when training data is seemingly unrelated to such biases. This paper investigates the conditions under which these biases are transmitted from a teacher model to a student model, providing insight into the mechanics of subliminal learning and the fragility of this learning mode.

Experimental Setup and Key Findings

The experiments focused on scenarios where a student model learns from a biased teacher, even though the student is exposed only to data irrelevant to the bias. This was conducted by distilling preferences for certain animals through abstract prompts, such as numerical sequences. Interestingly, subliminal learning occurs not only under soft distillation, where a student has access to the full next-token distribution, but also under hard distillation with only sampled tokens. Contrary to initial assumptions, neither token entanglement nor logit leakage is necessary for this phenomenon. Instead, a small set of divergence tokens—specific points where different biases predict different tokens—played a critical role in transferring biases.

Figure 1: Hidden biases in subliminal learning are carried by divergence tokens.

Mechanistic Analysis Through Divergence Tokens

Divergence tokens emerge as critical drivers of subliminal learning, enabling the transfer of hidden biases. In contexts where both factual and counterfactual models predict sequences, divergence tokens exist where bias-influenced predictions manifest. Experimental results supported this hypothesis, demonstrating that masking out these divergence tokens substantially reduced subliminal learning, while computing losses solely at divergence tokens preserved or enhanced bias transmission.

Figure 2: Results of loss computation focusing on divergence tokens.

Critical Importance of Early Layers

Further analysis revealed that subliminal learning can be largely attributed to alterations in early layers of the model stack. Finetuning even a single early layer was sufficient to induce subliminal behavioral traits. Attribution patching highlighted that early tokens exert outsize influence in establishing biases, underscoring the importance of hierarchical embedding mechanisms in bias transfer.

Fragility and Countermeasures

Subliminal learning was found to be highly fragile and easily disrupted by minor modifications such as prompt paraphrases and data mixing from multiple biased teachers. These perturbations can robustly suppress bias transmission, pointing to potential avenues for improving model alignment and reducing susceptibility to unintended behavioral traits during distillation.

Implications for AI Safety and Future Research

The phenomenon of subliminal learning poses challenges for AI safety, particularly in the context of alignment strategies and preventing unintended behaviors. The simplicity with which these biases can be interrupted suggests potential strategies for ensuring safer deployment of LLMs.

In summary, this work advances our understanding of subliminal learning by identifying divergence tokens as pivotal elements in hidden bias transfer. It highlights the fragility of subliminal learning and its dependence on early model layers. Future research can focus on exploiting these insights to develop methods for reliably preventing unintended bias transmission.

Conclusion

Subliminal learning does not require traditional mechanisms such as logit leakage or token entanglement. Instead, it is intricately tied to divergence tokens which serve as critical indicators. Through controlled experiments and mechanistic analyses, it is shown that small changes effectively mitigate subliminal bias transfer, providing a promising direction for further research in AI alignment and safety.

Markdown Report Issue