TIMIT Speaker Profiling: A Comparison of Multi-task learning and Single-task learning Approaches (2404.12077v1)
Abstract: This study employs deep learning techniques to explore four speaker profiling tasks on the TIMIT dataset, namely gender classification, accent classification, age estimation, and speaker identification, highlighting the potential and challenges of multi-task learning versus single-task models. The motivation for this research is twofold: firstly, to empirically assess the advantages and drawbacks of multi-task learning over single-task models in the context of speaker profiling; secondly, to emphasize the undiminished significance of skillful feature engineering for speaker recognition tasks. The findings reveal challenges in accent classification, and multi-task learning is found advantageous for tasks of similar complexity. Non-sequential features are favored for speaker recognition, but sequential ones can serve as starting points for complex models. The study underscores the necessity of meticulous experimentation and parameter tuning for deep learning models.
- Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1. NASA STI/Recon technical report n, 93:27403.
- End-to-end speaker age and height estimation using attention mechanism and triplet loss. In 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1–8. IEEE.
- Damian Kwasny and Daria Hemmerling. 2020. Joint gender and age estimation based on speech signals using x-vectors and transfer learning. arXiv preprint arXiv:2012.01551.
- Gender, age, and dialect identification for speaker profiling. In 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA).
- Thomas Marquet and Elisabeth Oswald. 2023. A comparison of multi-task learning and single-task learning approaches. Cryptology ePrint Archive.
- Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv preprint arXiv:1803.04567.
- Short-term analysis for estimating physical parameters of speakers. In 2016 4th International Conference on Biometrics and Forensics (IWBF), pages 1–6. IEEE.
- Multi-task recurrent model for speech and speaker recognition. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1–4. IEEE.
- A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 34(12):5586–5609.
- Accent recognition with hybrid phonetic features. Sensors, 21(18):6258.