Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 34 tok/s
GPT-5 High 30 tok/s Pro
GPT-4o 91 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 248 tok/s Pro
2000 character limit reached

DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation (2405.13274v2)

Published 22 May 2024 in cs.CL

Abstract: Non-autoregressive Transformers (NATs) are recently applied in direct speech-to-speech translation systems, which convert speech across different languages without intermediate text data. Although NATs generate high-quality outputs and offer faster inference than autoregressive models, they tend to produce incoherent and repetitive results due to complex data distribution (e.g., acoustic and linguistic variations in speech). In this work, we introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models. After training with a self-supervised noise estimation objective, DiffNorm constructs normalized target data by denoising synthetically corrupted speech features. Additionally, we propose to regularize NATs with classifier-free guidance, improving model robustness and translation quality by randomly dropping out source information during training. Our strategies result in a notable improvement of about +7 ASR-BLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) translations on the CVSS benchmark, while attaining over 14x speedup for En-Es and 5x speedup for En-Fr translations compared to autoregressive baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Speech Enhancement for Noise-Robust Speech Synthesis Using Wasserstein GAN. In Proc. Interspeech 2019, pages 1821–1825.
  2. Hifi++: A unified framework for bandwidth extension and speech enhancement. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  3. Effectiveness of self-supervised pre-training for speech recognition. ArXiv, abs/1911.03912.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations.
  5. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
  6. Speech-to-speech translation for a real-world unwritten language. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 4969–4983, Toronto, Canada. Association for Computational Linguistics.
  7. Diffusion models beat gans on image synthesis.
  8. DASpeech: Directed acyclic transformer for fast and high-quality speech-to-speech translation. In Advances in Neural Information Processing Systems.
  9. Mask-predict: Parallel decoding of conditional masked language models. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112–6121, Hong Kong, China. Association for Computational Linguistics.
  10. Generative adversarial networks.
  11. Non-autoregressive neural machine translation. In International Conference on Learning Representations.
  12. Levenshtein transformer. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  13. Conformer: Convolution-augmented transformer for speech recognition. CoRR, abs/2005.08100.
  14. Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc.
  15. Cascaded diffusion models for high fidelity image generation.
  16. Classifier-free diffusion guidance.
  17. Long short-term memory. Neural computation, 9(8):1735–1780.
  18. Hubert: Self-supervised speech representation learning by masked prediction of hidden units.
  19. Directed acyclic transformer for non-autoregressive machine translation. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9410–9428. PMLR.
  20. Transpeech: Speech-to-speech translation with bilateral perturbation. In The Eleventh International Conference on Learning Representations.
  21. UnitY: Two-pass direct speech-to-speech translation with discrete units. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15655–15680, Toronto, Canada. Association for Computational Linguistics.
  22. Translatotron 2: High-quality direct speech-to-speech translation with voice preservation.
  23. Cvss corpus and massively multilingual speech-to-speech translation.
  24. Direct speech-to-speech translation with a sequence-to-sequence model. ArXiv, abs/1904.06037.
  25. Adam: A method for stochastic optimization.
  26. Auto-encoding variational bayes.
  27. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.
  28. Janus-iii: speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 99–102 vol.1.
  29. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
  30. Direct speech-to-speech translation with discrete units. In Muresan, S., Nakov, P., and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3327–3339, Dublin, Ireland. Association for Computational Linguistics.
  31. Textless speech-to-speech translation on real data. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 860–872, Seattle, United States. Association for Computational Linguistics.
  32. AudioLDM: Text-to-audio generation with latent diffusion models. Proceedings of the International Conference on Machine Learning.
  33. The atr multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing, 14(2):365–376.
  34. Improved denoising diffusion probabilistic models. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8162–8171. PMLR.
  35. fairseq: A fast, extensible toolkit for sequence modeling. In Ammar, W., Louis, A., and Mostafazadeh, N., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
  36. Scalable diffusion models with transformers.
  37. Speech resynthesis from discrete disentangled self-supervised representations.
  38. Grad-tts: A diffusion probabilistic model for text-to-speech. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8599–8608. PMLR.
  39. Fastspeech 2: Fast and high-quality end-to-end text to speech.
  40. Fastspeech: Fast, robust and controllable text to speech.
  41. High-resolution image synthesis with latent diffusion models.
  42. U-net: Convolutional networks for biomedical image segmentation.
  43. Photorealistic text-to-image diffusion models with deep language understanding.
  44. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.
  45. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers.
  46. Denoising diffusion implicit models. In International Conference on Learning Representations.
  47. Score-based generative modeling through stochastic differential equations.
  48. Speech-to-speech translation between untranscribed unknown languages. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 593–600.
  49. Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System Using Deep Recurrent Neural Networks. In Proc. Interspeech 2016, pages 352–356.
  50. Wavenet: A generative model for raw audio.
  51. Neural discrete representation learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  52. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  53. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
  54. Covost 2 and massively multilingual speech-to-text translation.
  55. Tacotron: Towards end-to-end speech synthesis. In Interspeech.
  56. Uwspeech: Speech to speech translation for unwritten languages.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces DiffNorm, a self-supervised diffusion normalization strategy that simplifies speech feature distributions in non-autoregressive translation systems.
  • It employs synthetic noise injection and a denoising process to tackle the multi-modality problem, resulting in coherent, high-quality outputs.
  • The approach, enhanced by classifier-free guidance, achieves significant ASR-BLEU improvements and inference speed gains over autoregressive models.

Simplifying Non-Autoregressive Speech Translation with DiffNorm

Introduction

In recent years, Non-Autoregressive Transformers (NATs) have shown promise for direct speech-to-speech translation (S2ST), yielding faster inference and maintaining competitive translation quality compared to their autoregressive counterparts. However, NATs struggle with the "multi-modality problem," which results in incoherent and repetitive outputs due to the complexity of speech data distributions.

To address this, a new strategy known as DiffNorm has been introduced. DiffNorm relies on diffusion-based normalization to simplify these data distributions, thus enhancing the performance of NATs. This article will break down the core concepts behind DiffNorm, its implementation, and the resulting benefits.

DiffNorm: A New Approach to Speech Normalization

The Multi-Modality Problem in NATs

NATs can generate high-quality outputs and offer significant speed advantages over autoregressive models. However, they often produce outputs that are incoherent or repetitive. This issue stems from the assumption of conditional independence during parallel decoding, which struggles to capture the complex variations in speech data, such as acoustic and linguistic differences.

Introducing DiffNorm

DiffNorm is a self-supervised strategy based on Denoising Diffusion Probabilistic Models (DDPM). It works by injecting synthetic noise into speech features and then recovering the original features through a denoising process. The denoising objective helps create a simpler and more consistent data distribution, which is crucial for training NAT models effectively.

Here’s how DiffNorm works in a nutshell:

  1. Synthetic Noise Injection: Speech features are injected with noise, which creates a corrupted version of the original data.
  2. Denoising Process: Using a diffusion model, the system gradually removes the noise to recover the speech features.

By training the system to denoise synthetically corrupted features, DiffNorm normalizes the data. This eliminates the need for transcription data or manually crafted perturbation functions.

Enhancing NATs with Classifier-Free Guidance

In addition to DiffNorm, the researchers proposed a method called classifier-free guidance to regularize NATs. This involves randomly dropping out source information during training, compelling the model to generate coherent outputs even without full context. By doing so, the model becomes more robust and produces higher-quality translations.

Strong Numerical Results

The benefits of DiffNorm and classifier-free guidance are highlighted through strong numerical results:

  • English-Spanish (En-Es) Translation: Around a +7 ASR-BLEU improvement.
  • English-French (En-Fr) Translation: Around a +2 ASR-BLEU improvement.
  • Inference Speed: Achieving over 14× speedup for En-Es and 5× speedup for En-Fr compared to autoregressive baselines.

Implications and Future Directions

The findings have both practical and theoretical implications. On a practical level, the improvements in speed and accuracy make direct speech-to-speech translation systems more viable for real-world applications. Theoretically, the use of diffusion models and classifier-free guidance for NATs opens up new avenues for future research in AI and machine learning.

In the future, we can anticipate further developments that refine these techniques, making them even more efficient and accurate. This research lays the groundwork for more sophisticated speech translation systems that could become ubiquitous in various communication and accessibility applications.

Conclusion

DiffNorm and classifier-free guidance offer a promising solution to the multi-modality problem in NATs, resulting in significant performance improvements for direct speech-to-speech translation. These advancements not only enhance the efficiency and accuracy of these systems but also pave the way for future innovations in the field. If you’re intrigued and want to delve deeper, you can check out the full research and implementation details on GitHub.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube