Self-supervised Reflective Learning through Self-distillation and Online Clustering for Speaker Representation Learning (2401.01473v3)
Abstract: Speaker representation learning is crucial for voice recognition systems, with recent advances in self-supervised approaches reducing dependency on labeled data. Current two-stage iterative frameworks, while effective, suffer from significant computational overhead due to repeated rounds of clustering and training. They also struggle with noisy pseudo labels that can impair model learning. This paper introduces self-supervised reflective learning (SSRL), an improved framework that addresses these limitations by enabling continuous refinement of pseudo labels during training. Through a teacher-student architecture and online clustering mechanism, SSRL eliminates the need for iterative training rounds. To handle label noise, we incorporate noisy label modeling and pseudo label queues that maintain temporal consistency. Experiments on VoxCeleb show SSRL's superiority over current two-stage iterative approaches, surpassing the performance of a 5-round method in just a single training round. Ablation studies validate the contributions of key components like noisy label modeling and pseudo label queues. Moreover, consistent improvements in pseudo labeling and the convergence of cluster counts demonstrate SSRL's effectiveness in deciphering unlabeled data. This work marks an important advancement in efficient and accurate self-supervised speaker representation learning through the novel reflective learning paradigm.
- D. Cai, W. Wang, and M. Li, “An Iterative Framework for Self-Supervised Deep Speaker Representation Learning,” in Proceeding of ICASSP, 2021, pp. 6728–6732.
- ——, “Incorporating Visual Information in Audio Based Self-Supervised Speaker Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1422–1435, 2022.
- J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning,” NeurIPS, vol. 33, pp. 21 271–21 284, 2020.
- X. Chen and K. He, “Exploring Simple Siamese Representation Learning,” in Proceedings of CVPR, 2021, pp. 15 750–15 758.
- M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging Properties in Self-supervised Vision Transformers,” in Proceedings of ICCV, 2021, pp. 9650–9660.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” in NeurIPS Deep Learning and Representation Learning Workshop, 2015.
- E. Arazo, D. Ortego, P. Albert, N. E. O’Connor, and K. McGuinness, “Unsupervised Label Noise Modeling and Loss Correction,” in Proceedings of the International Conference on Machine Learning, 2019.
- M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep Clustering for Unsupervised Learning of Visual Features,” in Proceedings of ECCV, 2018.
- Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-Labelling Via Simultaneous Clustering and Representation Learning,” in ICLR, 2020.
- J. Li, P. Zhou, C. Xiong, and S. C. H. Hoi, “Prototypical Contrastive Learning of Unsupervised Representations,” in ICLR, 2021.
- X. Zhan, J. Xie, Z. Liu, Y.-S. Ong, and C. C. Loy, “Online Deep Clustering for Unsupervised Representation Learning,” in Proceedings of CVPR, 2020, pp. 6687–6696.
- X. Yang, Z. Song, I. King, and Z. Xu, “A Survey on Deep Semi-Supervised Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 9, pp. 8934–8954, 2023.
- M.-R. Amini, V. Feofanov, L. Pauletto, E. Devijver, and Y. Maximov, “Self-training: A survey,” arXiv:2202.12040, 2022.
- Z. Ke, D. Wang, Q. Yan, J. Ren, and R. W. Lau, “Dual student: Breaking the Limits of the Teacher in Semi-supervised Learning,” in Proceedings of CVPR, 2019, pp. 6728–6736.
- K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C.-L. Li, “Fixmatch: Simplifying Semi-supervised Learning with Consistency and Confidence,” NeurIPS, vol. 33, pp. 596–608, 2020.
- X. Chen, Y. Yuan, G. Zeng, and J. Wang, “Semi-supervised Semantic Segmentation with Cross Pseudo Supervision,” in Proceedings of CVPR, 2021, pp. 2613–2622.
- Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training with Noisy Student improves ImageNet Classification,” in Proceedings of CVPR, 2020, pp. 10 687–10 698.
- P. P. Busto, A. Iqbal, and J. Gall, “Open Set Domain Adaptation for Image and Action Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 2, pp. 413–429, 2018.
- G. French, M. Mackiewicz, and M. Fisher, “Self-ensembling for Visual Domain Adaptation,” in ICLR, 2018.
- Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence Regularized Self-training,” in Proceedings of CVPR, 2019, pp. 5982–5991.
- Y. Zou, Z. Yu, B. Kumar, and J. Wang, “Unsupervised Domain Adaptation for Semantic Segmentation via Class-balanced Self-training,” in Proceedings of ECCV, 2018, pp. 289–305.
- A. Tarvainen and H. Valpola, “Mean Teachers are Better Role Models: Weight-Averaged Consistency Targets Improve Semi-supervised Deep Learning Results,” NeurIPS, vol. 30, 2017.
- S. Laine and T. Aila, “Temporal Ensembling for Semi-Supervised Learning,” in ICLR, 2016.
- S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, “Deep Co-training for Semi-supervised Image Recognition,” in Proceedings of ECCV, 2018, pp. 135–152.
- W. Dong-DongChen and Z. WeiGao, “Tri-net for Semi-supervised Deep Learning,” in Proceeding of IJCAI, 2018, pp. 2014–2020.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive Angular Margin Loss for Deep Face Recognition,” in Proceedings of CVPR, 2019, pp. 4685–4694.
- K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning,” in Proceedings of CVPR, 2020, pp. 9729–9738.
- M. Cuturi, “Sinkhorn Distances: Lightspeed Computation of Optimal Transport,” NeurIPS, vol. 26, 2013.
- A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A Large-Scale Speaker Identification Dataset,” in Proceeding of Interspeech, 2017, pp. 2616–2620.
- J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep Speaker Recognition,” in Proceeding of Interspeech, 2018, pp. 1086–1090.
- D. Cai, W. Cai, and M. Li, “Within-Sample Variability-Invariant Loss for Robust Speaker Recognition Under Noisy Environments,” in Proceeding of ICASSP, 2020, pp. 6469–6473.
- N. Inoue and K. Goto, “Semi-Supervised Contrastive Learning with Generalized Contrastive Loss and its Application to Speaker Recognition,” in Proceeding of APSIPA ASC, 2020, pp. 1641–1646.
- J. Kang, J. Huh, H. S. Heo, and J. S. Chung, “Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1253–1262, 2022.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A Simple Framework for Contrastive Learning of Visual Representations,” in Proceedings of the International Conference on Machine Learning, 2020, pp. 1597–1607.
- D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,” arXiv:1510.08484, 2015.
- T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition,” in Proceeding of ICASSP, 2017, pp. 5220–5224.
- W. Cai, J. Chen, and M. Li, “Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System,” in Proceeding of The Speaker and Language Recognition Workshop (Odyssey), 2018, pp. 74–81.
- B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proceeding of Interspeech, 2020, pp. 3830–3834.
- Y. Chen, S. Zheng, H. Wang, L. Cheng, and Q. Chen, “Pushing the Limits of Self-Supervised Speaker Verification Using Regularized Distillation Framework,” in Proceeding of ICASSP, 2023, pp. 1–5.
- “NIST 2016 Speaker Recognition Evaluation Plan,” 2016. [Online]. Available: https://www.nist.gov/system/files/documents/2016/10/07/sre16_eval_plan_v1.3.pdf
- Y. M. Asano, M. Patrick, C. Rupprecht, and A. Vedaldi, “Labelling Unlabelled Videos from Scratch with Multi-Modal Self-Supervision,” NeurIPS, vol. 33, pp. 4660–4671, 2020.
- J. Munkres, “Algorithms for the Assignment and Transportation Problems,” Journal of the society for industrial and applied mathematics, vol. 5, no. 1, pp. 32–38, 1957.
- J. Thienpondt, B. Desplanques, and K. Demuynck, “The IDLAB VoxCeleb Speaker Recognition Challenge 2020 System Description,” in VoxSRC workshop, 2020.
- S. H. Mun, M. H. Han, and N. S. Kim, “SNU-HIL System for the VoxCeleb Speaker Recognition Challenge 2021,” in VoxSRC workshop, 2021.
- R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-Supervised Speaker Recognition with Loss-Gated Learning,” in Proceeding of ICASSP, 2022, pp. 6142–6146.
- B. Han, Z. Chen, and Y. Qian, “Self-Supervised Speaker Verification Using Dynamic Loss-Gate and Label Correction,” in Proceeding of Interspeech, 2022, pp. 4780–4784.
- H. Chen, H. Zhang, L. Wang, K. A. Lee, M. Liu, and J. Dang, “Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning,” in Proceeding of ICASSP, 2023, pp. 1–5.
- Z. Chen, J. Wang, W. Hu, L. Li, and Q. Hong, “Unsupervised Speaker Verification Using Pre-Trained Model and Label Correction,” in Proceeding of ICASSP, 2023, pp. 1–5.