Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification (2312.03620v3)
Abstract: Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a state-of-the-art ResNet baseline model gains a significant performance improvement on VoxCeleb, SITW, and CNCeleb datasets with 7.70%/11.76% average EER/minDCF reductions, respectively, across different network depths (ResNet18, 34, 50, and 101), while reducing the number of parameters by 16.5% and FLOPs by 4.1%. We refer to it as Gemini ResNet. Further investigation reveals the efficacy of the proposed Golden Gemini operating points across various training conditions and architectures. Furthermore, we present a new benchmark, namely the Gemini DF-ResNet, using a cutting-edge model.
- A. Larcher, K. A. Lee, B. Ma, and H. Li, “Text-dependent speaker verification: Classifiers, databases and RSR2015,” Speech Commun., vol. 60, pp. 56–77, 2014.
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5329–5333.
- D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in IEEE Spoken Language Technology Workshop, 2016, pp. 165–170.
- S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. IEEE Int. Conf. Comput. Vis., 2007, pp. 1–8.
- L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matějka, and N. Brümmer, “Discriminatively trained probabilistic linear discriminant analysis for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2011, pp. 4832–4835.
- Q. Wang, K. Okabe, K. A. Lee, and T. Koshinaka, “Generalized domain adaptation framework for parametric back-end in speaker recognition,” IEEE Trans. Inf. Forensics Secur., vol. 18, pp. 3936–3947, 2023.
- Q. Wang, K. A. Lee, and T. Liu, “Scoring of large-margin embeddings for speaker verification: Cosine or PLDA?” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2022, pp. 600–604.
- L. Ferrer, M. McLaren, and N. Brümmer, “A speaker verification backend with robust performance across conditions,” Comput. Speech Lang., vol. 71, p. 101258, 2022.
- A. Sholokhov, N. Kuzmin, K. A. Lee, and E. S. Chng, “Probabilistic back-ends for online speaker recognition and clustering,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2015, pp. 3214–3218.
- S. Wang, Y. Yang, Y. Qian, and K. Yu, “Revisiting the statistics pooling layer in deep speaker embedding learning,” in International Symposium on Chinese Spoken Language Processing, 2021, pp. 1–5.
- D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2017, pp. 999–1003.
- K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2018, pp. 2252–2256.
- Y. Zhu, T. Ko, D. Snyder, B. Mak, and D. Povey, “Self-attentive speaker embeddings for text-independent speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2018, pp. 3573–3577.
- K. A. Lee, Q. Wang, and T. Koshinaka, “Xi-vector embedding for speaker recognition,” IEEE Signal Process. Lett., vol. 28, pp. 1385–1389, 2021.
- J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive angular margin loss for deep face recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 4690–4699.
- Z. Bai, J. Wang, X.-L. Zhang, and J. Chen, “End-to-end speaker verification via curriculum bipartite ranking weighted binary cross-entropy,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1330–1344, 2022.
- J. S. Chung, J. Huh, S. Mun, M. Lee, H.-S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee, and I. Han, “In defence of metric learning for speaker recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2020, pp. 2977–2981.
- H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot, “BUT system description to VoxCeleb Speaker Recognition Challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
- W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 5791–5795.
- H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- X. Miao, I. McLoughlin, W. Wang, and P. Zhang, “D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition,” Neural Networks, vol. 139, pp. 201–211, 2021.
- T. Zhou, Y. Zhao, and J. Wu, “Resnext and Res2Net structures for speaker verification,” in IEEE Spoken Language Technology Workshop, 2021, pp. 301–307.
- B. Liu, Z. Chen, and Y. Qian, “Depth-first neural architecture with attentive feature fusion for efficient speaker verification,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 1825–1838, 2023.
- B. Liu, Z. Chen, S. Wang, H. Wang, B. Han, and Y. Qian, “DF-ResNet: Boosting speaker verification performance with depth-first design,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2022, pp. 296–300.
- W. Lin and M.-W. Mak, “Robust speaker verification using deep weight space ensemble,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 802–812, 2023.
- Y. Chen, S. Zheng, H. Wang, L. Cheng, Q. Chen, and J. Qi, “An enhanced Res2Net with local and global feature fusion for speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2023, pp. 2228–2232.
- Y. Kwon, H.-S. Heo, B.-J. Lee, and J. S. Chung, “The ins and outs of speaker recognition: lessons from VoxSRC 2020,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 5809–5813.
- X. Qin, N. Li, C. Weng, D. Su, and M. Li, “Simple attention module based speaker verification with iterative noisy label detection,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6722–6726.
- H. Shen, Y. Yang, G. Sun, R. Langman, E. Han, J. Droppo, and A. Stolcke, “Improving fairness in speaker verification via group-adapted fusion network,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 7077–7081.
- C. Zeng, X. Wang, E. Cooper, X. Miao, and J. Yamagishi, “Attention back-end for automatic speaker verification with multiple enrollment utterances,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6717–6721.
- B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2020, pp. 3830–3834.
- D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 5796–5800.
- Y. Ma, K. A. Lee, V. Hautamäki, and H. Li, “PL-EESR: Perceptual loss based end-to-end robust speaker representation extraction,” in Proc. IEEE Workshop on Autom. Speech Recognit. Understanding, 2021, pp. 106–113.
- R. Tao, K. A. Lee, Z. Shi, and H. Li, “Speaker recognition with two-step multi-modal deep cleansing,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- R. K. Das, R. Tao, J. Yang, W. Rao, C. Yu, and H. Li, “HLT-NUS submission for 2019 NIST multimedia speaker recognition evaluation,” in IEEE Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit Conf., 2020, pp. 605–609.
- B. Han, Z. Chen, and Y. Qian, “Local information modeling with self-attention for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6727–6731.
- P. Safari, M. India, and J. Hernando, “Self-attention encoding and pooling for speaker recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2020, pp. 941–945.
- J. Thienpondt, B. Desplanques, and K. Demuynck, “Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2021, pp. 2302–2306.
- T. Liu, R. K. Das, K. A. Lee, and H. Li, “MFA: TDNN with multi-scale frequency-channel attention for text-independent speaker verification with short utterances,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 7517–7521.
- H. Wang, S. Zheng, Y. Chen, L. Cheng, and Q. Chen, “CAM++: A fast and efficient network for speaker verification using context-aware masking,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2023, pp. 5301–5305.
- S. H. Mun, J.-w. Jung, M. H. Han, and N. S. Kim, “Frequency and multi-scale selective kernel attention for speaker verification,” in IEEE Spoken Language Technology Workshop, 2023, pp. 548–554.
- D. Cai, W. Wang, M. Li, R. Xia, and C. Huang, “Pretraining conformer with asr for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- Z. Zhao, Z. Li, W. Wang, and P. Zhang, “PCF: ECAPA-TDNN with progressive channel fusion for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- T. Liu, R. K. Das, K. A. Lee, and H. Li, “Neural acoustic-phonetic approach for speaker verification with phonetic attention mask,” IEEE Signal Process. Lett., vol. 29, pp. 782–786, 2022.
- Y. Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H. yi Lee, and H. Meng, “MFA-Conformer: Multi-scale feature aggregation conformer for automatic speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2022, pp. 306–310.
- X. Wang, F. Wang, B. Xu, L. Xu, and J. Xiao, “P-vectors: A parallel-coupled tdnn/transformer network for speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2023, pp. 3182–3186.
- A. Brown, J. Huh, J. S. Chung, A. Nagrani, D. Garcia-Romero, and A. Zisserman, “VoxSRC 2021: The third VoxCeleb Speaker Recognition Challenge,” arXiv preprint arXiv:2201.04583, 2022.
- J. Huh, A. Brown, J.-w. Jung, J. S. Chung, A. Nagrani, D. Garcia-Romero, and A. Zisserman, “VoxSRC 2022: The fourth VoxCeleb Speaker Recognition Challenge,” arXiv preprint arXiv:2302.10248, 2023.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
- R. Makarov, N. Torgashov, A. Alenin, I. Yakovlev, and A. Okhotnikov, “ID R&D system description to VoxCeleb Speaker Recognition Challenge 2022,” 2022.
- Z. Zhao, Z. Li, W. Wang, and P. Zhang, “The HCCL system for VoxCeleb Speaker Recognition Challenge 2022,” arXiv preprint arXiv:2305.12642, 2023.
- D. Cai and M. Li, “The DKU-DukeECE system for the self-supervision speaker verification task of the 2021 VoxCeleb Speaker Recognition Challenge,” arXiv preprint arXiv:2109.02853, 2021.
- M. Zhao, Y. Ma, M. Liu, and M. Xu, “The SpeakIn system for VoxCeleb Speaker Recognition Challenge 2021,” arXiv preprint arXiv:2109.01989, 2021.
- M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “SpEx+: A complete time domain speaker extraction network,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2020, pp. 1406–1410.
- Z. Pan, R. Tao, C. Xu, and H. Li, “Selective listening by synchronizing speech with lips,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1650–1664, 2022.
- X. Wang, N. Pan, J. Benesty, and J. Chen, “On multiple-input/binaural-output antiphasic speaker signal extraction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- Y. Jiang, R. Tao, Z. Pan, and H. Li, “Target active speaker detection with audio-visual cues,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2023, pp. 3152–3156.
- Z. Pan, W. Wang, M. Borsdorf, and H. Li, “ImagineNet: Target speaker extraction with intermittent visual cue through embedding inpainting,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- W. Wang, X. Qin, and M. Li, “Cross-channel attention-based target speaker voice activity detection: Experimental results for the m2met challenge,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 9171–9175.
- M. Cheng, W. Wang, Y. Zhang, X. Qin, and M. Li, “Target-speaker voice activity detection via sequence-to-sequence prediction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- R. Tao, Z. Pan, R. K. Das, X. Qian, M. Z. Shou, and H. Li, “Is someone speaking? Exploring long-term temporal features for audio-visual active speaker detection,” in Proc. ACM Int. Conf. Multimed., 2021, p. 3927–3935.
- X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, S. Chen, Y. Zhao, G. Liu, Y. Wu, J. Wu, S. Liu, J. Li, and Y. Gong, “Microsoft speaker diarization system for the VoxCeleb Speaker Recognition Challenge 2020,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 5824–5828.
- J.-W. Jung, H.-S. Heo, B.-J. Lee, J. Huh, A. Brown, Y. Kwon, S. Watanabe, and J. S. Chung, “In search of strong embedding extractors for speaker diarisation,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- R. Yan, C. Wen, S. Zhou, T. Guo, W. Zou, and X. Li, “Audio deepfake detection system with neural stitching for add 2022,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 9226–9230.
- X. Chen, J. Wang, X.-L. Zhang, W.-Q. Zhang, and K. Yang, “LMD: A learnable mask network to detect adversarial examples for speaker verification,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2476–2490, 2023.
- B. Huang, S. Cui, J. Huang, and X. Kang, “Discriminative frequency information learning for end-to-end speech anti-spoofing,” IEEE Signal Process. Lett., vol. 30, pp. 185–189, 2023.
- R. K. Das, J. Yang, and H. Li, “Data augmentation with signal companding for detection of logical access attacks,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 6349–6353.
- L. Perez and J. Wang, “The effectiveness of data augmentation in image classification using deep learning,” arXiv preprint arXiv:1712.04621, 2017.
- Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 976–11 986.
- L. Tóth, “Combining time- and frequency-domain convolution in convolutional neural network-based phone recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2014, pp. 190–194.
- T. Liu, K. A. Lee, Q. Wang, and H. Li, “Disentangling voice and content with self-supervision for speaker recognition,” Proc. Adv. Neural Inf. Process. Syst., vol. 36, pp. 50 221–50 236, 2023.
- T. Liu, R. K. Das, M. Madhavi, S. Shen, and H. Li, “Speaker-utterance dual attention for speaker and utterance verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2020, pp. 4293–4297.
- T. Liu, M. Madhavi, R. K. Das, and H. Li, “A unified framework for speaker and utterance verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2019, pp. 4320–4324.
- S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog, 2017, pp. 1492–1500.
- S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr, “Res2Net: A new multi-scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 652–662, 2019.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7132–7141.
- J. Campbell, “Speaker recognition: a tutorial,” Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, 1997.
- D. Rentzos, S. Vaseghi, E. Turajlic, Q. Yan, and C.-H. Ho, “Transformation of speaker characteristics for voice conversion,” in Proc. IEEE Workshop on Autom. Speech Recognit. Understanding, 2003, pp. 706–711.
- A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2017, pp. 2616–2620.
- J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2018, pp. 1086–1090.
- M. McLaren, L. Ferrer, D. Castan, and A. Lawson, “The speakers in the wild (SITW) speaker recognition database,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2016, pp. 818–822.
- Y. Fan, J. Kang, L. Li, K. Li, H. Chen, S. Cheng, P. Zhang, Z. Zhou, Y. Cai, and D. Wang, “CN-Celeb: A challenging chinese speaker recognition dataset,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7604–7608.
- M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “SpeechBrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Int. Conf. Learn. Represent., 2015.
- L. N. Smith, “Cyclical learning rates for training neural networks,” in IEEE Winter Conf. Appl. Comput. Vis., 2017, pp. 464–472.
- D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2019, pp. 2613–2617.
- H. Yamamoto, K. A. Lee, K. Okabe, and T. Koshinaka, “Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2019, pp. 406–410.
- T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2017, pp. 5220–5224.
- P. Kenny, “Bayesian speaker verification with heavy-tailed priors.” in Proc. Process. Speaker Lang. Recognit. Workshop, 2010, p. 14.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Int. Conf. Learn. Represent., 2019.
- P. Matějka, O. Novotný, O. Plchot, L. Burget, M. D. Sánchez, and J. Černocký, “Analysis of score normalization in multilingual speaker recognition,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2017, pp. 1567–1571.
- Q. Wang, K. A. Lee, and T. Liu, “Incorporating uncertainty from speaker embedding estimation to speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 10 012–10 022.
- Q. Li, L. Yang, X. Wang, X. Qin, J. Wang, and M. Li, “Towards lightweight applications: Asymmetric enroll-verify structure for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 7067–7071.
- F. Tong, M. Zhao, J. Zhou, H. Lu, Z. Li, L. Li, and Q. Hong, “ASV-Subtools: Open source toolkit for automatic speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 6184–6188.
- J. Li, Y. Tian, and T. Lee, “Convolution-based channel-frequency attention for text-independent speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
- Y. Li, J. Gan, and X. Lin, “DS-TDNN: Dual-stream time-delay neural network with global-aware filter for speaker verification,” arXiv preprint arXiv:2303.11020, 2023.
- J. Yao, C. Liang, Z. Peng, B. Zhang, and X.-L. Zhang, “Branch-ECAPA-TDNN: A parallel branch architecture to capture local and global features for speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2023, pp. 1943–1947.
- B. Liu and Y. Qian, “ECAPA++: Fine-grained deep embedding learning for tdnn based speaker verification,” in Proc. Annu. Conf. Int. Speech Commun. Assoc, 2023, pp. 3132–3136.