Theoretical analysis of why pseudo-labeling improves ASR performance

Develop a theoretical framework that explains why training Conformer-1 with large-scale pseudo-labeled speech data via Noisy Student Training yields empirical improvements in Word Error Rate and robustness, and rigorously evaluate hypothesized mechanisms such as suppression of outlier samples and expanded coverage of the training distribution to establish a scientific basis for these effects.

Background

Conformer-1 demonstrates substantial empirical gains by augmenting 57k hours of human-labeled data with 520k hours of pseudo-labeled public speech, following a Noisy Student Training paradigm. While the empirical benefits are clear across multiple benchmarks, the work explicitly notes the absence of a theoretical explanation for why pseudo-labeling produces these improvements.

The authors hypothesize mechanisms such as outlier suppression and broader distributional coverage but emphasize the need for a principled theoretical account. Establishing such a framework would clarify when and why pseudo-labeling is effective and guide future design choices for semi-supervised ASR systems.

References

However this conclusion is purely based off empirical results and further theoretical analysis of these results is an open area of exploration. We hypothesize that pseudo-labels are helping for different reasons, including suppressing the negative effects of outlier samples and covering a wider train distribution, but want to get a more scientific basis for our explanations.

Conformer-1: Robust ASR via Large-Scale Semisupervised Bootstrapping (2404.07341 - Zhang et al., 10 Apr 2024) in Section 'Limitations and Future Work'