DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation (2405.13274v2)
Abstract: Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e.g., acoustic and linguistic variations in speech). In this work, we introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models. After training with a self-supervised noise estimation objective, DiffNorm constructs normalized target data by denoising synthetically corrupted speech features. Additionally, we propose to regularize NATs with classifier-free guidance, improving model robustness and translation quality by randomly dropping out source information during training. Our strategies result in a notable improvement of about +7 ASR-BLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) translations on the CVSS benchmark, while attaining over 14x speedup for En-Es and 5x speedup for En-Fr translations compared to autoregressive baselines.
- Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN. In Proc. Interspeech 2019, pages 1821–1825.
- Hifi++: A unified framework for bandwidth extension and speech enhancement. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
- Effectiveness of self-supervised pre-training for speech recognition. ArXiv, abs/1911.03912.
- wav2vec 2.0: A framework for self-supervised learning of speech representations.
- Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
- Speech-to-speech translation for a real-world unwritten language. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 4969–4983, Toronto, Canada. Association for Computational Linguistics.
- Diffusion models beat gans on image synthesis.
- DASpeech: Directed acyclic transformer for fast and high-quality speech-to-speech translation. In Advances in Neural Information Processing Systems.
- Mask-predict: Parallel decoding of conditional masked language models. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112–6121, Hong Kong, China. Association for Computational Linguistics.
- Generative adversarial networks.
- Non-autoregressive neural machine translation. In International Conference on Learning Representations.
- Levenshtein transformer. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- Conformer: Convolution-augmented transformer for speech recognition. CoRR, abs/2005.08100.
- Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc.
- Cascaded diffusion models for high fidelity image generation.
- Classifier-free diffusion guidance.
- Long short-term memory. Neural computation, 9(8):1735–1780.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units.
- Directed acyclic transformer for non-autoregressive machine translation. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9410–9428. PMLR.
- Transpeech: Speech-to-speech translation with bilateral perturbation. In The Eleventh International Conference on Learning Representations.
- UnitY: Two-pass direct speech-to-speech translation with discrete units. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15655–15680, Toronto, Canada. Association for Computational Linguistics.
- Translatotron 2: High-quality direct speech-to-speech translation with voice preservation.
- Cvss corpus and massively multilingual speech-to-speech translation.
- Direct speech-to-speech translation with a sequence-to-sequence model. ArXiv, abs/1904.06037.
- Adam: A method for stochastic optimization.
- Auto-encoding variational bayes.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.
- Janus-iii: speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 99–102 vol.1.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Direct speech-to-speech translation with discrete units. In Muresan, S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3327–3339, Dublin, Ireland. Association for Computational Linguistics.
- Textless speech-to-speech translation on real data. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872, Seattle, United States. Association for Computational Linguistics.
- AudioLDM: Text-to-audio generation with latent diffusion models. Proceedings of the International Conference on Machine Learning.
- The atr multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing, 14(2):365–376.
- Improved denoising diffusion probabilistic models. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8162–8171. PMLR.
- fairseq: A fast, extensible toolkit for sequence modeling. In Ammar, W., Louis, A., and Mostafazadeh, N., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
- Scalable diffusion models with transformers.
- Speech resynthesis from discrete disentangled self-supervised representations.
- Grad-tts: A diffusion probabilistic model for text-to-speech. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8599–8608. PMLR.
- Fastspeech 2: Fast and high-quality end-to-end text to speech.
- Fastspeech: Fast, robust and controllable text to speech.
- High-resolution image synthesis with latent diffusion models.
- U-net: Convolutional networks for biomedical image segmentation.
- Photorealistic text-to-image diffusion models with deep language understanding.
- Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.
- Denoising diffusion implicit models. In International Conference on Learning Representations.
- Score-based generative modeling through stochastic differential equations.
- Speech-to-speech translation between untranscribed unknown languages. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 593–600.
- Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks. In Proc. Interspeech 2016, pages 352–356.
- Wavenet: A generative model for raw audio.
- Neural discrete representation learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
- Covost 2 and massively multilingual speech-to-text translation.
- Tacotron: Towards end-to-end speech synthesis. In Interspeech.
- Uwspeech: Speech to speech translation for unwritten languages.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run paper prompts using GPT-5.