Transfer the linguistic representations from TTS to accent conversion with non-parallel data (2401.03538v1)
Abstract: Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.
- “Non-native speech synthesis preserving speaker individuality based on partial correction of prosodic and phonetic characteristics.,” in Interspeech 2015, Aug 2021.
- “Subband based voice conversion.,” Conference of the International Speech Communication Association,Conference of the International Speech Communication Association, Jan 2002.
- “Parrotron: An end-to-end speech-to-speech conversion model and its applications to hearing-impaired speech and speech separation,” in Interspeech 2019, Sep 2019.
- “Accent conversion using phonetic posteriorgrams,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr 2018.
- “Foreign accent conversion by synthesizing speech from phonetic posteriorgrams.,” in Interspeech 2019, Sep 2019.
- “Improving accent conversion with reference encoder and end-to-end text-to-speech.,” Cornell University - arXiv,Cornell University - arXiv, May 2020.
- “Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,” Computer Speech & Language, vol. 72, pp. 101302, 2022.
- “Converting foreign accent speech without a reference,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, p. 2367–2381, Jan 2021.
- “Zero-shot foreign accent conversion without a native reference,” in Proc. Interspeech 2022, 2022, pp. 4920–4924.
- “Accent conversion using pre-trained model and synthesized data from voice conversion,” .
- “End-to-end accent conversion without using native utterances,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6289–6293.
- “Voice-preserving zero-shot multiple accent conversion,” Nov 2022.
- “Tts-guided training for accent conversion without parallel data,” .
- “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
- “L2-arctic: A non-native english speech corpus.,” in Interspeech, 2018, pp. 2783–2787.
- “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
- “Libritts-r: A restored multi-speaker text-to-speech corpus,” arXiv preprint arXiv:2305.18802, 2023.
- “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
- “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Xi Chen (1040 papers)
- Jiakun Pei (1 paper)
- Liumeng Xue (24 papers)
- Mingyang Zhang (56 papers)