Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling (2404.00856v1)
Abstract: Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker information in the speech representation. This paper aims to remove speaker information by exploiting the structured nature of speech, composed of discrete units like phonemes with clear boundaries. A neural network predicts these boundaries, enabling variable-length pooling for event-based representation extraction instead of fixed-rate methods. The boundary predictor outputs a probability for the boundary between 0 and 1, making pooling soft. The model is trained to minimize the difference with the pooled representation of the data augmented by time-stretch and pitch-shift. To confirm that the learned representation includes contents information but is independent of speaker information, the model was evaluated with libri-light's phonetic ABX task and SUPERB's speaker identification task.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “Hubert: How much can a bad teacher benefit asr pre-training?,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6533–6537.
- “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, Eds. 2020, vol. 33, pp. 12449–12460, Curran Associates, Inc.
- “Multi-speaker speech synthesis from electromyographic signals by soft speech unit prediction,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “On generative spoken language modeling from raw audio,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1336–1354, 2021.
- “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 16251–16265, 2021.
- “Variable-rate hierarchical cpc leads to acoustic unit discovery in speech,” arXiv preprint arXiv:2206.02211, 2022.
- “Variable-rate discrete representation learning,” arXiv preprint arXiv:2103.06089, 2021.
- “The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling,” arXiv preprint arXiv:2011.11588, 2020.
- “Speech simclr: Combining contrastive and reconstruction objective for self-supervised speech representation learning,” arXiv preprint arXiv:2010.13991, 2020.
- “Contentvec: An improved self-supervised speech representation by disentangling speakers,” in International Conference on Machine Learning. PMLR, 2022, pp. 18003–18017.
- “Ccc-wav2vec 2.0: Clustering aided cross contrastive self-supervised learning of speech representations,” arXiv preprint arXiv:2210.02592, 2022.
- “Self-supervised contrastive learning for singing voices,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1614–1623, 2022.
- “Segmental contrastive predictive coding for unsupervised word segmentation,” arXiv preprint arXiv:2106.02170, 2021.
- “Self-supervised contrastive learning for unsupervised phoneme segmentation,” arXiv preprint arXiv:2007.13465, 2020.
- Elizbar A Nadaraya, “On estimating regression,” Theory of Probability & Its Applications, vol. 9, no. 1, pp. 141–142, 1964.
- “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- “The Zero Resource Speech Challenge 2021: Spoken Language Modelling,” in Proc. Interspeech 2021, 2021, pp. 1574–1578.
- “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6419–6423.