End-to-End Speech-to-Text Translation: A Survey (2312.01053v2)
Abstract: Speech-to-text translation pertains to the task of converting speech signals in a language to text in another language. It finds its application in various domains, such as hands-free communication, dictation, video lecture transcription, and translation, to name a few. Automatic Speech Recognition (ASR), as well as Machine Translation(MT) models, play crucial roles in traditional ST translation, enabling the conversion of spoken language in its original form to written text and facilitating seamless cross-lingual communication. ASR recognizes spoken words, while MT translates the transcribed text into the target language. Such disintegrated models suffer from cascaded error propagation and high resource and training costs. As a result, researchers have been exploring end-to-end (E2E) models for ST translation. However, to our knowledge, there is no comprehensive review of existing works on E2E ST. The present survey, therefore, discusses the work in this direction. Our attempt has been to provide a comprehensive review of models employed, metrics, and datasets used for ST tasks, providing challenges and future research direction with new insights. We believe this review will be helpful to researchers working on various applications of ST models.
- Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain Research Bulletin 50, 303–304.
- FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN, in: Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), Association for Computational Linguistics, Toronto, Canada (in-person and online). pp. 1–61.
- On the locality of attention in direct speech translation. arXiv preprint arXiv:2204.09028 .
- Tied multitask learning for neural speech translation. arXiv preprint arXiv:1802.06655 .
- An unsupervised probability model for speech-to-translation alignment of low-resource languages. arXiv preprint arXiv:1609.08139 .
- wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv:2006.11477.
- A comparative study on end-to-end speech to text translation, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE. pp. 792–799.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72.
- Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv preprint arXiv:1809.01431 .
- Towards speech-to-text translation without speech recognition. arXiv preprint arXiv:1702.03856 .
- Cascade versus direct speech translation: Do the differences still make a difference?, in: Annual Meeting of the Association for Computational Linguistics.
- End-to-end automatic speech translation of audiobooks, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 6224–6228.
- Listen and translate: A proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744 .
- Exploring continuous integrate-and-fire for adaptive simultaneous speech translation. ArXiv abs/2204.09595.
- Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv preprint arXiv:2106.06909 .
- Direct simultaneous speech-to-text translation assisted by synchronized streaming asr, in: Findings.
- Thinking slow about latency evaluation for simultaneous machine translation. ArXiv abs/1906.00048.
- Kosp2e: Korean speech to english translation corpus. arXiv preprint arXiv:2107.02875 .
- Stylekqc: A style-variant paraphrase corpus for korean questions and commands. arXiv preprint arXiv:2103.13439 .
- Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), IEEE. pp. 539–546.
- The chime corpus: a resource and a challenge for computational hearing in multisource environments, in: Eleventh annual conference of the international speech communication association.
- The fisher corpus: A resource for the next generations of speech-to-text., in: LREC, pp. 69–71.
- The miami corpus: Documentation file. Bangortalk, bangortalk. org. uk/docs/Miami_doc. pdf .
- Must-c: a multilingual speech translation corpus, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics. pp. 2012–2017.
- Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, in: Proceedings of the second international conference on Human Language Technology Research, pp. 138–145.
- Cif: Continuous integrate-and-fire for end-to-end speech recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 6079–6083.
- Listen, understand and translate: Triple supervision decouples end-to-end speech-to-text translation, in: AAAI Conference on Artificial Intelligence.
- Learning when to translate for streaming speech. arXiv preprint arXiv:2109.07368 .
- An attentional model for speech translation without transcription, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 949–959.
- Stemm: Self-learning with speech-text manifold mixup for speech translation. arXiv preprint arXiv:2203.10426 .
- Microsoft speech language translation (mslt) corpus: The iwslt 2016 release for english, french and german, in: Proceedings of the 13th International Conference on Spoken Language Translation.
- Simultaneous translation of lectures and speeches. Machine Translation 21, 209–252.
- On knowledge distillation for direct speech translation. ArXiv abs/2012.04964.
- End-to-end speech translation with pre-trained models and adapters: Upc at iwslt 2021. ArXiv abs/2105.04512.
- One-to-many multilingual end-to-end speech translation. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 585–592.
- A very low resource language speech corpus for computational language documentation experiments. arXiv preprint arXiv:1710.03501 .
- Segmentation of input in simultaneous translation. Journal of Psycholinguistic Research 1, 127–140.
- Sequence transduction with recurrent neural networks. ArXiv abs/1211.3711.
- Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Doha, Qatar. pp. 1342–1352. doi:10.3115/v1/D14-1140.
- Learning shared semantic space for speech-to-text translation, in: Findings.
- Distilling the knowledge in a neural network. arXiv:1503.02531.
- Parameter-efficient transfer learning for nlp, in: International Conference on Machine Learning, PMLR. pp. 2790–2799.
- An analysis of semantically-aligned speech-text embeddings, in: 2022 IEEE Spoken Language Technology Workshop (SLT), IEEE. pp. 747–754.
- Multilingual end-to-end speech translation. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , 570–577.
- Espnet-st: All-in-one speech translation toolkit. arXiv preprint arXiv:2004.10234 .
- Europarl-st: A multilingual corpus for speech translation of parliamentary debates, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 8229–8233.
- Leveraging weakly supervised data to improve end-to-end speech-to-text translation, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. pp. 7180–7184.
- Libri-light: A benchmark for asr with limited or no supervision. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 7669–7673.
- Cstnet: Contrastive speech translation network for self-supervised speech representation learning. arXiv preprint arXiv:2006.02814 .
- Joint ctc-attention based end-to-end speech recognition using multi-task learning, in: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 4835–4839.
- Augmenting librispeech with french translations: A multimodal corpus for direct speech translation evaluation. arXiv preprint arXiv:1802.03142 .
- Pre-training for speech translation: Ctc meets optimal transport. ArXiv abs/2301.11716.
- Dual-decoder transformer for joint automatic speech recognition and multilingual speech translation. CoRR abs/2011.00747. URL: https://arxiv.org/abs/2011.00747, arXiv:2011.00747.
- Lightweight adapter tuning for multilingual speech translation. ArXiv abs/2106.01463.
- Building a psychological ground truth dataset with empathy and theory-of-mind during the covid-19 pandemic, in: Proceedings of the Annual Meeting of the Cognitive Science Society.
- Multilingual speech translation from efficient finetuning of pretrained models, in: Annual Meeting of the Association for Computational Linguistics.
- Cross attention augmented transducer networks for simultaneous translation, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 39–55.
- Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726–742.
- End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075 .
- Bridging the modality gap for speech-to-text translation. arXiv preprint arXiv:2010.14920 .
- Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework, in: Annual Meeting of the Association for Computational Linguistics.
- Simuleval: An evaluation toolkit for simultaneous translation, in: Conference on Empirical Methods in Natural Language Processing.
- Gentle forced aligner. github. com/lowerquality/gentle .
- Optimizing segmentation strategies for simultaneous speech translation, in: Annual Meeting of the Association for Computational Linguistics.
- fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038 .
- Waco: Word-aligned contrastive learning for speech translation. arXiv preprint arXiv:2212.09359 .
- Librispeech: an asr corpus based on public domain audio books, in: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE. pp. 5206–5210.
- Attention as a guide for simultaneous speech translation, in: Annual Meeting of the Association for Computational Linguistics.
- Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318.
- Kss dataset: Korean single speaker speech dataset.
- Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11, 355–607.
- chrf: character n-gram f-score for automatic mt evaluation, in: Proceedings of the tenth workshop on statistical machine translation, pp. 392–395.
- Simulspeech: End-to-end simultaneous speech to text translation, in: Annual Meeting of the Association for Computational Linguistics.
- The multilingual tedx corpus for speech recognition and translation. arXiv preprint arXiv:2102.01757 .
- How2: a large-scale dataset for multimodal language understanding. arXiv preprint arXiv:1811.00347 .
- Prabhupadavani: A code-mixed speech translation data for 25 languages. arXiv preprint arXiv:2201.11391 .
- Evaluating speech translation systems: Applying score to transtac technologies, in: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 223–230.
- A study of translation edit rate with targeted human annotation, in: Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pp. 223–231.
- Attention-passing models for robust and data-efficient end-to-end speech translation. Transactions of the Association for Computational Linguistics 7, 313–325.
- Whitening sentence representations for better semantics and faster retrieval. arXiv:2103.15316.
- Unified speech-text pre-training for speech translation and recognition. arXiv preprint arXiv:2204.05409 .
- Cross-modal transfer learning for multilingual speech-to-text translation. ArXiv abs/2010.12829.
- Attention is all you need, in: NIPS.
- Simple and effective unsupervised speech translation. arXiv preprint arXiv:2210.10191 .
- Covost: A diverse multilingual speech-to-text translation corpus, in: International Conference on Language Resources and Evaluation.
- Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, in: Annual Meeting of the Association for Computational Linguistics.
- Fairseq s2t: Fast speech-to-text modeling with fairseq. arXiv preprint arXiv:2010.05171 .
- Covost 2 and massively multilingual speech-to-text translation. arXiv: Computation and Language .
- Large-scale self- and semi-supervised learning for speech translation, in: Interspeech.
- Curriculum pre-training for end-to-end speech translation. arXiv preprint arXiv:2004.10093 .
- Lamassu: Streaming language-agnostic multilingual speech recognition and translation using neural transducers. arXiv preprint arXiv:2211.02809 .
- Sequence-to-sequence models can directly translate foreign speech. arXiv preprint arXiv:1703.08581 .
- End-to-end speech translation for code switched speech. arXiv preprint arXiv:2204.05076 .
- Recent advances in direct speech-to-text translation. ArXiv abs/2306.11646.
- Large-scale streaming end-to-end speech translation with neural transducers. arXiv preprint arXiv:2204.05352 .
- Cmu’s iwslt 2023 simultaneous speech translation system, in: International Workshop on Spoken Language Translation.
- End-to-end speech translation via cross-modal progressive training. arXiv preprint arXiv:2104.10380 .
- Cross-modal contrastive learning for speech translation. arXiv preprint arXiv:2205.02444 .
- Gigast: A 10,000-hour pseudo speech translation corpus. arXiv preprint arXiv:2204.03939 .
- Realtrans: End-to-end simultaneous speech translation with convolutional weighted-shrinking transformer. ArXiv abs/2106.04833.
- Adatrans: Adapting with boundary-based shrinking for end-to-end speech translation. ArXiv abs/2212.08911.
- Open source toolkit for speech to text translation. Prague Bull. Math. Linguistics 111, 125–135.
- Revisiting end-to-end speech-to-text translation from scratch, in: International Conference on Machine Learning, PMLR. pp. 26193–26205.
- Improving speech translation by cross-modal multi-grained contrastive learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 1075–1086.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 .
- Neurst: Neural speech translation toolkit. arXiv preprint arXiv:2012.10018 .
- M-adapter: Modality adaptation for end-to-end speech-to-text translation. arXiv preprint arXiv:2207.00952 .
- Nivedita Sethiya (2 papers)
- Chandresh Kumar Maurya (9 papers)