- The paper proposes enhancing ASR noise robustness using parallel clean/noisy data and teacher-student learning with a novel logit selection method.
- The approach yields significant WER reductions of approximately 10.1%, 28.7%, and 19.6% on clean, simulated noisy, and realistic test sets, respectively.
- This method provides a practical way to improve ASR robustness in real-world conditions without relying heavily on costly manually transcribed data.
Improving Noise Robustness of Automatic Speech Recognition via Parallel Data and Teacher-Student Learning
This paper presents a methodological enhancement to improve the noise robustness of Automatic Speech Recognition (ASR) systems using parallel data and teacher-student (T/S) learning frameworks. The focus is on achieving better performance under multimedia noise conditions without relying heavily on manually transcribed data, which is often costly and time-intensive to gather.
The proposed approach leverages a parallel corpus composed of clean and noisy speech data, where the clean data is artificially corrupted to generate noisy counterparts. This parallel data serves as the foundation for T/S learning, a technique initially developed to distill knowledge from larger models into smaller ones, but here repurposed for domain adaptation. The core objective is to reduce the Word Error Rate (WER) of ASR systems under challenging noisy conditions.
Key to the proposed approach is the introduction of a logits selection method. This method retains only the top k logit values to avoid misleading the student model with erroneous senone predictions while also optimizing memory and bandwidth for data transfer. Empirical results demonstrate that the application of this method leads to significant improvements in WER reductions: approximately 10.1%, 28.7%, and 19.6% on clean, simulated noisy, and realistic test sets, respectively, compared to a sequence-trained teacher model.
Methodology
The architecture used for both teacher and student models comprises Long Short-Term Memory (LSTM) layers followed by a fully connected output layer for senone classification, tailored with Log Mel-Filter-Bank Energies as input features. The T/S training paradigm minimizes the Kullback-Leibler divergence between the output distributions of the teacher and student models, with the student relying exclusively on these soft targets generated by the teacher, thus negating the need for direct supervision through transcriptions.
The paper systematically evaluates several configurations of distillation temperature and logits selection to ascertain their impact on recognition accuracy. Results from these experiments underscore that a specific temperature (T=2) and a carefully chosen logits threshold (k=20) optimize model performance. The modification to the logits computation, involving the assignment of a high negative constant to non-selected logits, proves effective in maintaining training efficiency.
Results
Performance was evaluated using three distinct test datasets: clean, simulated noisy, and realistic conditions. The student models trained with the T/S framework exhibited robust improvements compared to both the baseline teacher and multi-condition trained models. Notably, increasing the parallel training dataset to 4,800 hours yielded the most pronounced accuracy gains, suggesting an optimal balance between data volume and model capacity.
The authors also explore the utility of sequence training by further refining both teacher and student models using the state-level Minimum Bayes Risk (sMBR) criterion. Sequence training provides additional WER reductions and represents a meaningful extension of the initial gains achieved through cross-entropy training.
Implications and Future Work
This research contributes a practical approach to enhancing noise robustness in ASR systems, with implications for real-world applications where clean, transcribed data is limited. The combination of T/S learning with parallel data contributes to reduction in WER without the exhaustive need for manual data transcription, although further increases in noise corpus diversity may provide additional benefits.
Future directions proposed by the authors include expanding the corpus of noise profiles and exploring the scalability of model architectures and training datasets. Another area for exploration is the refinement of soft target selection based on the certainty of the teacher, potentially enhancing student model training.
This work signifies a noteworthy advancement in ASR robustness under noise, steering efforts towards more efficient, scalable, and practical solutions for evolving acoustic environments. The foundation set by this paper holds promise for ongoing and future explorations into robust ASR technologies.