Why some models transfer hidden biases and others do not

Determine the model-level factors that cause certain large language models to transmit hidden biases via subliminal learning while other models show little or no hidden bias transfer, and explain the mechanisms underlying these differences across model families and architectures.

Background

The paper shows that subliminal learning—hidden bias transfer during distillation—occurs in multiple settings and models, but not uniformly across all models. In additional experiments (Appendix: Results on additional models), some open-weight models (e.g., Phi-4) exhibited clear subliminal transfer, whereas others (e.g., Llama-3.2-3B-Instruct, Ministral-8B-Instruct, Falcon3-7B-Instruct) showed little to none.

This variability raises a fundamental question about the determinants of susceptibility to subliminal learning. The authors explicitly state that understanding why some models do and others do not transfer hidden biases remains unresolved.

References

Understanding why certain models do and others do not transfer hidden biases remains an open question for future work.

— Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer (2509.23886 - Schrodi et al., 28 Sep 2025) in Discussion (Section 7), paragraph titled "Does subliminal learning work for all models?"

Why some models transfer hidden biases and others do not

Sponsor

Background

References

Related Problems