Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Simultaneous Interpretation Corpus Construction by Large Language Models in Distant Language Pair (2404.12299v1)

Published 18 Apr 2024 in cs.CL, cs.AI, cs.LG, cs.SD, and eess.AS

Abstract: In Simultaneous Machine Translation (SiMT) systems, training with a simultaneous interpretation (SI) corpus is an effective method for achieving high-quality yet low-latency systems. However, it is very challenging to curate such a corpus due to limitations in the abilities of annotators, and hence, existing SI corpora are limited. Therefore, we propose a method to convert existing speech translation corpora into interpretation-style data, maintaining the original word order and preserving the entire source content using LLMs (LLM-SI-Corpus). We demonstrate that fine-tuning SiMT models in text-to-text and speech-to-text settings with the LLM-SI-Corpus reduces latencies while maintaining the same level of quality as the models trained with offline datasets. The LLM-SI-Corpus is available at \url{https://github.com/yusuke1997/LLM-SI-Corpus}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Findings of the 2021 conference on machine translation (WMT21). In Proceedings of the Sixth Conference on Machine Translation, pages 1–88, Online. Association for Computational Linguistics.
  2. Monotonic infinite lookback attention for simultaneous machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1313–1323, Florence, Italy. Association for Computational Linguistics.
  3. wav2vec 2.0: a framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
  4. Qwen technical report.
  5. From EPIC to EPTIC – exploring simplification in interpreting and translation from an intermodal perspective. Target. International Journal of Translation Studies, 28(1):61–86.
  6. Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4.
  7. What affects the word order of target language in simultaneous interpretation. In 2020 International Conference on Asian Language Processing (IALP), pages 135–140.
  8. Erik Camayd-Freixas. 2011. Cognitive theory of simultaneous interpreting and training. In Proceedings of AMTA.
  9. Must-c: A multilingual corpus for end-to-end speech translation. Computer Speech & Language, 66:101155.
  10. Improving simultaneous translation by incorporating pseudo-references with fewer reorderings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5857–5864, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  11. Everlyn Chimoto and Bruce Bassett. 2022. COMET-QE and active learning for low-resource machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4735–4740, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  12. Incremental decoding and training methods for simultaneous translation in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 493–499, New Orleans, Louisiana. Association for Computational Linguistics.
  13. Improving simultaneous machine translation with monolingual data. Proceedings of the AAAI Conference on Artificial Intelligence, 37(11):12728–12736.
  14. Large-scale English-Japanese simultaneous interpretation corpus: Construction and analyses with sentence-aligned data. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 226–235, Bangkok, Thailand (online). Association for Computational Linguistics.
  15. Efficient Wait-k Models for Simultaneous Machine Translation. In Proc. Interspeech 2020, pages 1461–1465.
  16. Test data creation in simultaneous machine translation in english to japanese pair: Insights from simultaneous interpretation tactics. IPSJ SIG Technical Report. (In Japanese).
  17. NAIST simultaneous speech-to-speech translation system for IWSLT 2023. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 330–340, Toronto, Canada (in-person and online). Association for Computational Linguistics.
  18. Learning to translate in real-time with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1053–1062, Valencia, Spain. Association for Computational Linguistics.
  19. Simultaneous machine translation with tailored reference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3070–3084, Singapore. Association for Computational Linguistics.
  20. Monotonic simultaneous translation with chunk-wise reordering and refinement. In Proceedings of the Sixth Conference on Machine Translation, pages 1110–1123, Online. Association for Computational Linguistics.
  21. Syntax-based rewriting for simultaneous machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 55–64, Lisbon, Portugal. Association for Computational Linguistics.
  22. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460.
  23. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. https://github.com/facebookresearch/libri-light.
  24. Average Token Delay: A Latency Metric for Simultaneous Translation. In Proc. INTERSPEECH 2023, pages 4469–4473.
  25. Data augmenting contrastive learning of speech representations in the time domain. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 215–222.
  26. Tagged end-to-end simultaneous speech translation training using simultaneous interpretation data. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023), pages 363–375, Toronto, Canada (in-person and online). Association for Computational Linguistics.
  27. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  28. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  29. HeiCiC: A simultaneous interpreting corpus combining product and pre-process data. In Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age, pages 8–14, online. Association for Computational Linguistics.
  30. Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection. In Proc. Interspeech 2020, pages 3620–3624.
  31. Minh-Thang Luong and Christopher Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 76–79, Da Nang, Vietnam.
  32. STACL: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3025–3036, Florence, Italy. Association for Computational Linguistics.
  33. SIMULEVAL: An evaluation toolkit for simultaneous translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 144–150, Online. Association for Computational Linguistics.
  34. Lost in Interpreting: Speech Translation from Source or Interpreter? In Proc. Interspeech 2021, pages 2376–2380.
  35. Akira Mizuno. 2016. Simultaneous interpreting and cognitive constraints.
  36. JParaCrawl v3.0: A large-scale English-Japanese parallel corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6704–6710, Marseille, France. European Language Resources Association.
  37. Domain terminology integration into machine translation: Leveraging large language models. In Proceedings of the Eighth Conference on Machine Translation, pages 902–911, Singapore. Association for Computational Linguistics.
  38. Graham Neubig. 2011. The Kyoto free translation task. http://www.phontron.com/kftt.
  39. Yuki Okamura and Masaru Yamada. 2023. Jyun okuri yaku” no kihan to mohan doji tsuyaku wo mohan tosita kyoikuron no shiron (). In Hiroyuki Ishizuka, editor, Word Order in English-Japanese Interpreting and Translation: The History, Theory and Practice of Progressive Translation, pages 217–250. Hitsuji Syobo.
  40. Gpt-4 technical report.
  41. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, Minnesota. Association for Computational Linguistics.
  42. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  43. Jun Pan. 2019. The Chinese/English political interpreting corpus (CEPIC): A new electronic resource for translators and interpreters. In Proceedings of the Human-Informed Translation and Interpreting Technology Workshop (HiT-IT 2019), pages 82–88, Varna, Bulgaria. Incoma Ltd., Shoumen, Bulgaria.
  44. Over-generation cannot be rewarded: Length-adaptive average lagging for simultaneous speech translation. In Proceedings of the Third Workshop on Automatic Simultaneous Translation, pages 12–17, Online. Association for Computational Linguistics.
  45. Attention as a guide for simultaneous speech translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13340–13356, Toronto, Canada. Association for Computational Linguistics.
  46. AlignAtt: Using Attention-based Audio-Translation Alignments as a Guide for Simultaneous Speech Translation. In Proc. INTERSPEECH 2023, pages 3974–3978.
  47. Matthias Paulik and Alex Waibel. 2009. Automatic translation from parallel speech: Simultaneous interpretation as mt training data. In 2009 IEEE Workshop on Automatic Speech Recognition & Understanding, pages 496–501.
  48. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Brussels, Belgium. Association for Computational Linguistics.
  49. JESC: Japanese-English subtitle corpus. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  50. EPIC UdS - creation and applications of a simultaneous interpreting corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1193–1200, Marseille, France. European Language Resources Association.
  51. Learning compact metrics for MT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 751–762, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  52. COMET: A neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702, Online. Association for Computational Linguistics.
  53. WikiMatrix: Mining 135M parallel sentences in 1620 language pairs from Wikipedia. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1351–1361, Online. Association for Computational Linguistics.
  54. Collection of a simultaneous translation corpus for comparative analysis. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 670–673, Reykjavik, Iceland. European Language Resources Association (ELRA).
  55. Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.
  56. Gemini Team et al. 2024. Gemini: A family of highly capable multimodal models.
  57. Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey. European Language Resources Association (ELRA).
  58. Ciair simultaneous interpretation corpus. In Proceedings of Oriental COCOSDA.
  59. Pretrained speech encoders and efficient fine-tuning methods for speech translation: UPC at IWSLT 2022. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 265–276, Dublin, Ireland (in-person and online). Association for Computational Linguistics.
  60. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  61. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
  62. Fairseq S2T: Fast speech-to-text modeling with fairseq. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, pages 33–39, Suzhou, China. Association for Computational Linguistics.
  63. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  64. Tree of thoughts: Deliberate problem solving with large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  65. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  66. BSTC: A large-scale Chinese-English speech translation dataset. In Proceedings of the Second Workshop on Automatic Simultaneous Translation, pages 28–35, Online. Association for Computational Linguistics.
  67. Shaolei Zhang and Yang Feng. 2021. Universal simultaneous machine translation with mixture-of-experts wait-k policy. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7306–7317, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  68. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  69. It is not as good as you think! evaluating simultaneous machine translation on interpretation data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6707–6715, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  70. Naist-sic-aligned: an aligned english-japanese simultaneous interpretation corpus.
  71. Simultaneous translation policies: From fixed to adaptive. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2847–2853, Online. Association for Computational Linguistics.
  72. Fine-tuning large language models for domain-specific machine translation.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yusuke Sakai (36 papers)
  2. Mana Makinae (4 papers)
  3. Hidetaka Kamigaito (62 papers)
  4. Taro Watanabe (76 papers)
Citations (2)