Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neighborhood Attention Transformer with Progressive Channel Fusion for Speaker Verification (2405.12031v2)

Published 20 May 2024 in cs.SD and eess.AS

Abstract: Transformer-based architectures for speaker verification typically require more training data than ECAPA-TDNN. Therefore, recent work has generally been trained on VoxCeleb1&2. We propose a backbone network based on self-attention, which can achieve competitive results when trained on VoxCeleb2 alone. The network alternates between neighborhood attention and global attention to capture local and global features, then aggregates features of different hierarchical levels, and finally performs attentive statistics pooling. Additionally, we employ a progressive channel fusion strategy to expand the receptive field in the channel dimension as the network deepens. We trained the proposed PCF-NAT model on VoxCeleb2 and evaluated it on VoxCeleb1 and the validation sets of VoxSRC. The EER and minDCF of the shallow PCF-NAT are on average more than 20% lower than those of similarly sized ECAPA-TDNN. Deep PCF-NAT achieves an EER lower than 0.5% on VoxCeleb1-O. The code and models are publicly available at https://github.com/ChenNan1996/PCF-NAT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. X-vectors: Robust DNN embeddings for speaker recognition. In IEEE ICASSP, pages 5329–5333, 2018.
  2. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Proc. Interspeech, pages 3830–3834, 2020.
  3. Pcf: Ecapa-tdnn with progressive channel fusion for speaker verification. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  4. But system description to voxceleb speaker recognition challenge 2019, 2019.
  5. The idlab voxceleb speaker recognition challenge 2020 system description, 2020.
  6. The speakin system for voxceleb speaker recognition challange 2021, 2021.
  7. Id r&d system description to voxceleb speaker recognition challenge 2022. ID R&D Inc.: New York, NY, USA, 2022.
  8. The id r&d voxceleb speaker recognition challenge 2023 system description, 2023.
  9. Unisound system for voxceleb speaker recognition challenge 2023, 2023.
  10. Voxsrc 2019: The first voxceleb speaker recognition challenge, 2019.
  11. Voxsrc 2020: The second voxceleb speaker recognition challenge, 2020.
  12. Voxsrc 2021: The third voxceleb speaker recognition challenge. arXiv preprint arXiv:2201.04583, 2022.
  13. Voxsrc 2022: The fourth voxceleb speaker recognition challenge, 2023.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  15. Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification. In Proc. Interspeech 2021, pages 2302–2306, 2021.
  16. CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking. In Proc. INTERSPEECH 2023, pages 5301–5305, 2023.
  17. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  18. Unispeech: Unified speech representation learning with labeled and unlabeled data. In International Conference on Machine Learning, pages 10937–10947. PMLR, 2021.
  19. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  20. Large-scale self-supervised speech representation learning for automatic speaker verification. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6147–6151, 2022.
  21. MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification. In Proc. Interspeech 2022, pages 306–310, 2022.
  22. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040, 2020.
  23. Pretraining conformer with asr for speaker verification. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  24. VoxCeleb2: Deep Speaker Recognition. In Proc. Interspeech 2018, pages 1086–1090, 2018.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  26. Improving transformer-based networks with locality for automatic speaker verification. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
  27. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6185–6194, 2023.
  28. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
  29. Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704, 2021.
  30. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  31. Stand-alone self-attention in vision models. Advances in neural information processing systems, 32, 2019.
  32. Faster neighborhood attention: Reducing the o(n2̂) cost of self attention at the threadblock level, 2024.
  33. Speaker augmentation and bandwidth extension for deep speaker embedding. In Interspeech, pages 406–410, 2019.
  34. Audio augmentation for speech recognition. In Interspeech, volume 2015, page 3586, 2015.
  35. The dku-dukeece systems for voxceleb speaker recognition challenge 2020, 2020.
  36. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484, 2015.
  37. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5220–5224. IEEE, 2017.
  38. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proc. Interspeech 2017, pages 2616–2620, 2017.
  39. Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624, 2021.
  40. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019.
  41. Sub-center arcface: Boosting face recognition by large-scale noisy web faces. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 741–757. Springer, 2020.
  42. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147. PMLR, 2013.
  43. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  44. The kriston ai system for the voxceleb speaker recognition challenge 2022, 2022.
  45. Towards reduced false-alarms using cohorts. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4512–4515. IEEE, 2011.
  46. Analysis of score normalization in multilingual speaker recognition. In Interspeech, pages 1567–1571, 2017.

Summary

We haven't generated a summary for this paper yet.