Explain why layer-averaged embeddings boost alignment relative to single-layer embeddings

Establish a comprehensive explanation for why averaging hidden states across all layers to form mean embeddings in text-only large language models (specifically the Qwen3 family evaluated on WiT and AudioCaps) often yields higher mutual-kNN representational alignment with vision (e.g., DINOv2) and audio (e.g., BEATs) encoders than using embeddings from any single layer. Determine whether the observed improvement arises from smoothing layer-specific noise, integrating complementary information across layers, or other mechanisms, and characterize the conditions under which this effect holds.

Background

The paper measures representational alignment between text-only LLMs and unimodal sensory encoders using mutual-kNN alignment on paired image–text (WiT, DCI) and audio–text (AudioCaps, Clotho) datasets. To test whether sensory prompting effects are superficial, the authors compute alignment layer-by-layer and also consider embeddings formed by averaging hidden states across all layers.

They find that sensory prompting shifts representations across layers, not only at the output layer. Moreover, they observe that using the mean embedding across all layers often produces higher alignment with sensory encoders than any single layer. While they hypothesize that averaging may smooth layer-specific noise while retaining complementary information, they note that a full understanding of why this averaging improves alignment remains unresolved.

References

Interestingly, we also find that using the mean embedding across all layers often yields higher alignment than any single layer. One possible explanation is that averaging smooths out layer-specific noise while retaining complementary information across the hierarchy, though a full understanding of this phenomenon remains open for future study.

— Words That Make Language Models Perceive (2510.02425 - Wang et al., 2 Oct 2025) in Appendix: Extended Analysis of Sensory Prompting — Layer-wise evaluation

Explain why layer-averaged embeddings boost alignment relative to single-layer embeddings

Background

References

Related Problems