Fine-grained linguistic hierarchies in S3M layerwise representations

Determine whether fine-grained distinctions between phone-, syllable-, word-, and sentence-level linguistic structures are reflected in the layerwise representations of self-supervised speech models.

Background

Most prior analyses of self-supervised speech models have examined isolated structural levels (e.g., phonemes or words) using heterogeneous datasets and methods. Consequently, the field lacks a unified understanding of whether the models’ internal layerwise organization systematically mirrors distinct linguistic levels.

This problem targets whether separable, fine-grained linguistic levels (phones, syllables, words, sentences) are differentially represented across layers in self-supervised speech models, a question that bears on both interpretability and the design of probing methodologies.

References

However, because most linguistic investigations into S3M representations have so far focussed on individual structural levels, with analysis data and methods varying between studies, it is currently unknown whether fine-grained distinctions between levels of linguistic organization (e.g. phone-, syllable-, word- and sentence-level structures) are reflected in S3M layerwise hierarchies.

Tracking the emergence of linguistic structure in self-supervised models learning from speech  (2604.02043 - Kloots et al., 2 Apr 2026) in Section 2.1 (Related work: Layerwise hierarchies and linguistic structure in S3Ms)