Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

Published 28 Apr 2026 in cs.LG and cs.AI | (2604.25779v1)

Abstract: In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper shows that sustained gradient alignment mediates subliminal learning by transferring hidden traits during multi-step distillation.
It employs MNIST auxiliary logit distillation to reveal that explicit removal of trait-aligned gradients drastically reduces unintended trait transfer.
The study confirms that gradient projection is essential for mitigating misaligned trait acquisition, outperforming techniques that merely attenuate alignment.

Sustained Gradient Alignment and Subliminal Learning in Multi-Step Distillation

Background and Motivation

Knowledge distillation (KD) has emerged as a central technique for transferring information from a high-performing teacher network to a student, facilitating the deployment of models with reduced computational cost while preserving predictive performance. Recent work has exposed an undesirable phenomenon in this process: the transmission of hidden behavioral traits—misalignment or unintended properties—via subliminal learning, even when the distillation objective uses only auxiliary outputs and the teacher and student are initialized identically. Subliminal learning theory posits that the critical factor driving trait acquisition is the alignment between the gradients of the distillation objective and those associated with the teacher trait under single-step optimization. However, existing theoretical guarantees do not extend naturally to the multi-step setting typical of practical training regimes.

Experimental Paradigm

The study operationalizes the MNIST MLP auxiliary logit distillation scenario, where a teacher is trained on cross-entropy for digit classification, and a student is initialized identically but distills only on auxiliary logits via a synthetic noise dataset. Key metrics assessed include test accuracy (trait transfer), cross-entropy (trait loss), KL divergence (distillation loss), and gradient alignment (both inner product and cosine similarity between trait and distillation gradients). The authors benchmark mitigation via gradient-projection (removal of trait-aligned gradient component) and liminal training (annealed KL regularization) to elucidate mechanisms governing trait transmission.

Empirical Analysis

Persistence and Dynamics of Gradient Alignment

Empirical results indicate that gradient alignment persists through the entirety of multi-step training, with the inner product between trait and distillation gradients demonstrating positive values on a large majority of steps (fraction $0.781 \pm 0.102$ ), though the mean cosine similarity remains modest ( $0.00752 \pm 0.00167$ ). This weak but sustained alignment is particularly pronounced during early training, coinciding with periods of accelerated trait acquisition. Student models achieve classification test accuracy well above random guess (mean $55.3\% \pm 10.5\%$ ), confirming trait transfer in multi-step settings and supporting the theoretical expectation of one-step alignment-mediated subliminal learning.

Trait-Aligned Gradient Component Intervention

The authors ablate trait transfer by explicitly projecting out the trait-aligned gradient component during distillation updates. The intervention results in dramatically reduced trait acquisition: final test accuracy plummets to $10.1\% \pm 1.3\%$ , near random chance, while distillation loss curves remain unaffected, attesting to the selective suppression of trait transfer without impeding the core distillation objective. Cosine similarity diagnostics confirm that the intervention consistently nullifies trait-distillation gradient alignment. The evidence substantiates that even weak positive alignment is both necessary and sufficient for trait acquisition, and first-order gradient effects dominate under these conditions.

Efficacy of Liminal Training

Liminal training is implemented as an annealed KL divergence regularizer to minimize auxiliary logit deviations from the base model. Although it successfully attenuates gradient alignment early in training, it fails to eliminate trait acquisition: student models retain substantial test accuracy ( $48.97\% \pm 9.63\%$ ), and cross-entropy decay persists, accelerating as regularization weakens. The results underscore that approaches which only diminish, not cancel, trait-aligned gradient components are ineffective when first-order effects are causal. Effective mitigation in this regime demands removal of the trait-aligned component.

Implications for Model Alignment and Distillation

The findings have significant implications for safety-critical deployment and AI alignment. The channel for transmission of misaligned traits via subliminal learning in KD cannot be reliably closed by mitigation schemes focused solely on early alignment attenuation; the trait transfer persists unless trait-aligned gradient cancellation is enforced. This elevates gradient diagnostics and projection-based interventions as promising directions for suppressing trait acquisition. However, practical implementation is constrained by the need for ground truth trait gradients, which may be unavailable or costly to compute for large-scale models or latent behavioral traits.

From a theoretical perspective, the dominance of first-order effects in this setting suggests that higher-order landscape phenomena may play a substantive role only in regimes with different geometry or optimization dynamics, requiring further investigation.

Conclusion

This work establishes that sustained, albeit weak, positive gradient alignment between trait and distillation objectives mediates subliminal learning in multi-step KD. Trait acquisition is causally linked to the trait-aligned component of distillation gradients; its removal halts transfer without interfering with distillation, while mere attenuation (e.g., liminal training) is insufficient. These results foreground the necessity of explicit trait-aligned gradient cancellation for reliable mitigation, informing both practical distillation pipelines and theoretical understanding of trait transmission. Limitations remain in settings dominated by higher-order effects and in the operationalization of ground truth trait gradients, motivating further research into scalable, trait-sensitive distillation control mechanisms.

Markdown Report Issue