Empirical limits of decoding sentence-level information from S3Ms
Ascertain the empirical limits of decoding sentence-level linguistic information, such as semantic sentence similarity and syntactic dependency structure, from the internal representations of self-supervised speech models trained without textual supervision.
References
The empirical limits of decoding such information from S3Ms remain to be explored; Orhan et al. report that dependency structures extracted from a Wav2Vec2 model (trained on 900 hours of speech up to 400K steps) approximate those from text-based models in accuracy, demonstrating that S3M representations can indeed learn to encode rich syntactic structures after only SSL training.
— Tracking the emergence of linguistic structure in self-supervised models learning from speech
(2604.02043 - Kloots et al., 2 Apr 2026) in Discussion and Conclusions