Empirical limits of decoding sentence-level information from S3Ms

Ascertain the empirical limits of decoding sentence-level linguistic information, such as semantic sentence similarity and syntactic dependency structure, from the internal representations of self-supervised speech models trained without textual supervision.

Background

The authors observe minimal gains for most probes after 100K steps, with syntactic dependency encoding as a notable exception. They hypothesize that later training may chiefly improve beyond-word contextualization and that other sentence-level signals (e.g., semantic sentence similarity) might continue to improve.

This raises a broader question about the ultimate ceiling of sentence-level information recoverable from self-supervised speech model representations, given training data, objectives, and architecture.

References

The empirical limits of decoding such information from S3Ms remain to be explored; Orhan et al. report that dependency structures extracted from a Wav2Vec2 model (trained on 900 hours of speech up to 400K steps) approximate those from text-based models in accuracy, demonstrating that S3M representations can indeed learn to encode rich syntactic structures after only SSL training.

Tracking the emergence of linguistic structure in self-supervised models learning from speech  (2604.02043 - Kloots et al., 2 Apr 2026) in Discussion and Conclusions