Precise conditions for subliminal learning in practice

Determine the precise conditions required for subliminal learning to occur in practice when a student neural network is trained to imitate a teacher neural network on teacher-generated outputs from an unrelated data distribution, so that the student reliably acquires the teacher’s traits.

Background

The paper proves a theoretical result showing that, under shared initialization and a single small gradient step on teacher-generated labels, a student neural network will be pulled toward the teacher according to the teacher’s loss, regardless of the training distribution. However, the empirical experiments use multiple steps of SGD, sampled outputs, and filtered data, deviating from the theorem’s assumptions. Despite these deviations, subliminal learning appears robust in practice across several settings (numbers, code, chain-of-thought), suggesting that a more comprehensive characterization is needed.

The authors explicitly note that the exact practical conditions under which subliminal learning occurs are not yet established. Clarifying these conditions would guide when distillation can unintentionally transmit behavioral traits and when safeguards—such as differing initializations or specific training regimes—prevent transmission.

References

However, the precise conditions required for subliminal learning in practice remain an open question.

— Subliminal Learning: Language models transmit behavioral traits via hidden signals in data (2507.14805 - Cloud et al., 20 Jul 2025) in Section 6.1 (Theory)

Precise conditions for subliminal learning in practice

Background

References

Related Problems