Explain why layer-averaged embeddings boost alignment relative to single-layer embeddings
Establish a comprehensive explanation for why averaging hidden states across all layers to form mean embeddings in text-only large language models (specifically the Qwen3 family evaluated on WiT and AudioCaps) often yields higher mutual-kNN representational alignment with vision (e.g., DINOv2) and audio (e.g., BEATs) encoders than using embeddings from any single layer. Determine whether the observed improvement arises from smoothing layer-specific noise, integrating complementary information across layers, or other mechanisms, and characterize the conditions under which this effect holds.
References
Interestingly, we also find that using the mean embedding across all layers often yields higher alignment than any single layer. One possible explanation is that averaging smooths out layer-specific noise while retaining complementary information across the hierarchy, though a full understanding of this phenomenon remains open for future study.