NeXt-TDNN: Modernizing Multi-Scale Temporal Convolution Backbone for Speaker Verification (2312.08603v2)
Abstract: In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time. We have released our code for future studies.
- “Front-end factor analysis for speaker verification,” IEEE TASLP, vol. 19, no. 4, pp. 788–798, 2011.
- “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. of ICASSP, 2014, pp. 4052–4056.
- “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. of ICASSP, 2018, pp. 5329–5333.
- “Speaker recognition for multi-speaker conversations using x-vectors,” in Proc. of ICASSP, 2019, pp. 5796–5800.
- “Jhu-HLTCOE system for the voxsrc speaker recognition challenge,” in Proc. of ICASSP, 2020, pp. 7559–7563.
- “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech, 2020, pp. 3830–3834.
- “Res2Net: A new multi-scale backbone architecture,” IEEE TPAMI, vol. 43, no. 2, pp. 652–662, 2021.
- “Squeeze-and-excitation networks,” in Proc. of CVPR, June 2018.
- “EfficientTDNN: Efficient architecture search for speaker recognition,” IEEE/ACM TASLP, vol. 30, pp. 2267–2279, 2022.
- “Pushing the limits of raw waveform speaker recognition,” in Proc. Interspeech, 2022, pp. 2228–2232.
- “MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification,” in Proc. Interspeech, 2022, pp. 306–310.
- “Deep residual learning for image recognition,” in Proc. of CVPR, June 2016.
- “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
- “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” in Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 2018, pp. 74–81.
- “BUT system description to voxCeleb speaker recognition challenge 2019,” 2019.
- “In Defence of Metric Learning for Speaker Recognition,” in Proc. Interspeech, 2020, pp. 2977–2981.
- “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- “A Convnet for the 2020s,” in Proc. of CVPR, June 2022, pp. 11976–11986.
- “ConvNext V2: Co-designing and scaling ConvNets with masked autoencoders,” in Proc. of CVPR, June 2023, pp. 16133–16142.
- “Attentive statistics pooling for deep speaker embedding,” in Proc. Interspeech, 2018, pp. 2252–2256.
- “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc.
- “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proc. Interspeech, 2017, pp. 2616–2620.
- “Analysis of score normalization in multilingual speaker recognition,” in Proc. Interspeech, 2017, pp. 1567–1571.
- “Arcface: Additive angular margin loss for deep face recognition,” in Proc. of CVPR, June 2019.
- “Musan: A music, speech, and noise corpus,” 2015.
- “Image method for efficiently simulating small‐room acoustics,” The Journal of the Acoustical Society of America, vol. 60, no. S1, pp. S9–S9, 08 2005.
- “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
- “Accurate, large minibatch sgd: Training imagenet in 1 hour,” 2018.
- “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.