AttentionStitch: How Attention Solves the Speech Editing Problem
Abstract: The generation of natural and high-quality speech from text is a challenging problem in the field of natural language processing. In addition to speech generation, speech editing is also a crucial task, which requires the seamless and unnoticeable integration of edited speech into synthesized speech. We propose a novel approach to speech editing by leveraging a pre-trained text-to-speech (TTS) model, such as FastSpeech 2, and incorporating a double attention block network on top of it to automatically merge the synthesized mel-spectrogram with the mel-spectrogram of the edited text. We refer to this model as AttentionStitch, as it harnesses attention to stitch audio samples together. We evaluate the proposed AttentionStitch model against state-of-the-art baselines on both single and multi-speaker datasets, namely LJSpeech and VCTK. We demonstrate its superior performance through an objective and a subjective evaluation test involving 15 human participants. AttentionStitch is capable of producing high-quality speech, even for words not seen during training, while operating automatically without the need for human intervention. Moreover, AttentionStitch is fast during both training and inference and is able to generate human-sounding edited speech.
- A3t: Alignment-aware acoustic and text pretraining for speech synthesis and editing. In International Conference on Machine Learning, pp. 1399–1411. PMLR, 2022.
- Speechpainter: Text-conditioned speech inpainting. arXiv preprint arXiv:2202.07273, 2022.
- Chien, C.-M. Fastspeech 2 - pytorch implementation. https://github.com/ming024/FastSpeech2, 2021.
- The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/, 2017.
- Voco: Text-based insertion and replacement in audio narration. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
- Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pp. 125–128. IEEE, 1993.
- Context-aware prosody correction for text-based speech editing. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7038–7042. IEEE, 2021.
- Recommendation itu-t p. 10/g. 100, vocabulary for performance, quality of service and quality of experience, 2017.
- Speech enhancement using convolutional neural network with skip connections. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 6–10. IEEE, 2018.
- Editts: Score-based editing for controllable text-to-speech. arXiv preprint arXiv:2110.02584, 2021.
- Editspeech: A text based speech editing system using partial inference and bidirectional fusion. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 626–633. IEEE, 2021.
- Speech enhancement based on deep neural networks with skip connections. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5565–5569. IEEE, 2017.
- Vo, N. Pytorch implementation of a2-nets: Double attention networks. https://github.com/nguyenvo09/Double-Attention-Network, 2019.
- CSTR VCTK Corpus: English multi-speaker corpus for Cstr voice cloning toolkit (version 0.92). In Proc. of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp. 3636–3642, Portorož, Slovenia, 2019.
- Maskedspeech: Context-aware speech synthesis with masking strategy. arXiv preprint arXiv:2211.06170, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.