RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing (2202.08862v3)

Published 17 Feb 2022 in cs.SD, cs.LG, and eess.AS

Abstract: We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.

Citations (50)

View on Semantic Scholar

Summary

The paper presents RemixIT, a self-supervised method that continually trains speech enhancement models by remixing bootstrapped clean and noisy signals.
RemixIT employs a student-teacher framework that refines pseudo-targets and achieves superior performance compared to existing methods in various adaptation tasks.
This method offers practical universal applicability for deploying speech enhancement systems in diverse real-world environments with limited clean data.

RemixIT: Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing

The paper introduces RemixIT, a novel method for training speech enhancement models in a self-supervised manner, eliminating the need for clean in-domain speech or noise waveforms. This approach addresses the limitations of previous methods that relied on clean target signals and were sensitive to domain mismatches between training and testing samples. RemixIT leverages a continual self-training scheme, where a pre-trained teacher model, trained on out-of-domain data, infers estimated pseudo-target signals for in-domain mixtures. By remixing the estimated clean and noise signals, new bootstrapped mixtures are generated, which are used to train a student network. RemixIT enables continual refinement of the teacher’s estimates and demonstrates superior performance compared to existing methods, showcasing its versatility across various separation models and adaptation tasks.

Methodology

RemixIT employs a student-teacher framework where the teacher model, pre-trained on out-of-domain data, provides pseudo-targets for in-domain noisy mixtures. The process involves remixing the teacher's speech estimates with permuted noise estimates to create bootstrapped mixtures. The student is trained using these mixtures, focusing on the student model's ability to reconstruct the speech signal while observing degraded pseudo-targets. The teacher model's estimates are periodically refined using the updated student model parameters, establishing a continual self-training scheme.

Results

Experimental results across multiple datasets and tasks highlight the effectiveness of RemixIT. The method surpasses previous approaches in speech enhancement and showcases adaptability to various semi-supervised and unsupervised domain adaptation tasks. Strong numerical results indicate RemixIT’s ability to operate robustly under different input noise levels, demonstrating that the student model consistently improves performance, even with severely noisy pseudo-target signals.

Implications and Future Directions

The theoretical implications of RemixIT lie in its ability to learn from imperfect teacher estimates, providing insights into self-supervised training dynamics in speech enhancement. Practically, this method offers a more universal applicability for deploying speech enhancement systems in diverse real-world environments, where access to clean, labeled data may be limited. Future developments could focus on enhancing the robustness of pseudo-target signals and exploring further integration with large-scale AI systems, pushing towards more generalized and efficient speech enhancement solutions.

Conclusion

RemixIT presents a significant advancement in self-supervised speech enhancement, addressing key limitations of prior methods and offering a scalable solution for diverse real-world applications. By leveraging continual self-training and bootstrapped remixing, it provides a robust framework adaptable to various scenarios while demonstrating significant improvements across multiple benchmarks. As AI continues to grow, methods like RemixIT are crucial for enhancing the fidelity and applicability of speech enhancement technologies, paving the way for future innovations in the field.

Related Papers

YouTube

Show All Videos