Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding Translationese in Cross-Lingual Summarization (2212.07220v2)

Published 14 Dec 2022 in cs.CL and cs.AI

Abstract: Given a document in a source language, cross-lingual summarization (CLS) aims at generating a concise summary in a different target language. Unlike monolingual summarization (MS), naturally occurring source-language documents paired with target-language summaries are rare. To collect large-scale CLS data, existing datasets typically involve translation in their creation. However, the translated text is distinguished from the text originally written in that language, i.e., translationese. In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese. Then we systematically investigate how translationese affects CLS model evaluation and performance when it appears in source documents or target summaries. In detail, we find that (1) the translationese in documents or summaries of test sets might lead to the discrepancy between human judgment and automatic evaluation; (2) the translationese in training sets would harm model performance in real-world applications; (3) though machine-translated documents involve translationese, they are very useful for building CLS systems on low-resource languages under specific training strategies. Lastly, we give suggestions for future CLS research including dataset and model developments. We hope that our work could let researchers notice the phenomenon of translationese in CLS and take it into account in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Translation artifacts in cross-lingual transfer learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7674–7684, Online. Association for Computational Linguistics.
  2. Cross-lingual abstractive summarization with limited parallel resources. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6910–6924, Online. Association for Computational Linguistics.
  3. Corpus Linguistics and Translation Studies: Implications and Applications, chapter 2. John Benjamins Publishing Company, Netherlands.
  4. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pages 41–48. ACM.
  5. How human is machine translationese? comparing human and machine translations of text and speech. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 280–290, Online. Association for Computational Linguistics.
  6. (meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 136–158, Prague, Czech Republic. Association for Computational Linguistics.
  7. Jointly learning to align and summarize for neural cross-lingual summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6220–6231, Online. Association for Computational Linguistics.
  8. Tagged back-translation. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 53–63, Florence, Italy. Association for Computational Linguistics.
  9. The cross-lingual conversation summarization challenge. ArXiv, abs/2205.00379.
  10. Michael Denkowski and Alon Lavie. 2010. Choosing the right evaluation for machine translation: an examination of annotator and automatic metric performance on human judgment tasks. In Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Research Papers, Denver, Colorado, USA. Association for Machine Translation in the Americas.
  11. On the evaluation of machine translation systems trained with back-translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2836–2846, Online. Association for Computational Linguistics.
  12. Michael Farrell. 2018. Machine translation markers in post-edited machine translation output. In Proceedings of the 40th Conference Translating and the Computer, pages 50–59.
  13. MSAMSum: Towards benchmarking multi-lingual dialogue summarization. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 1–12, Dublin, Ireland. Association for Computational Linguistics.
  14. Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378.
  15. Martin Gellerstam. 1986. Translationese in swedish novels translated from english. In Lars Wollin and Hans Lindquist, editors, Translation Studies in Scandinavia, page 88–95. CWK Gleerup.
  16. Martin Gellerstam. 1996. Translations as a source for cross-linguistic studies. Lund Studies in English, 88:53–62.
  17. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
  18. Statistical power and translationese in machine translation evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 72–81, Online. Association for Computational Linguistics.
  19. The efficacy of human post-editing for language translation. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 439–448.
  20. Crosssum: Beyond english-centric cross-lingual abstractive text summarization for 1500+ language pairs. ArXiv preprint, abs/2112.08804v1.
  21. XL-sum: Large-scale multilingual abstractive summarization for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4693–4703, Online. Association for Computational Linguistics.
  22. WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4034–4048, Online. Association for Computational Linguistics.
  23. Adapting translation models to translationese improves SMT. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 255–265, Avignon, France. Association for Computational Linguistics.
  24. Cross-lingual c*st*rd: English access to hindi information. ACM Trans. Asian Lang. Inf. Process., 2:245–269.
  25. Dual-gated fusion with prefix-tuning for multi-modal relation extraction. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8982–8994. Association for Computational Linguistics.
  26. Attribute-consistent knowledge graph representation learning for multi-modal entity alignment. In Proceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, pages 2499–2508. ACM.
  27. D2tv: Dual knowledge distillation and target-oriented vision modeling for many-to-many multimodal summarization. arXiv preprint arXiv:2305.12767.
  28. Summary-oriented vision modeling for multimodal abstractive summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2934–2951, Toronto, Canada. Association for Computational Linguistics.
  29. A variational hierarchical model for neural cross-lingual summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2088–2099, Dublin, Ireland. Association for Computational Linguistics.
  30. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  31. Tagged back-translation revisited: Why does it really work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5990–5997, Online. Association for Computational Linguistics.
  32. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
  33. Thong Thanh Nguyen and Anh Tuan Luu. 2022. Improving neural cross-lingual abstractive summarization via employing optimal transport distance for knowledge distillation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11103–11111.
  34. A robust abstractive system for cross-lingual summarization. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2025–2031, Minneapolis, Minnesota. Association for Computational Linguistics.
  35. Laura Perez-Beltrachini and Mirella Lapata. 2021. Models and datasets for cross-lingual summarisation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9408–9423, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  36. Federica Scarpa. 2006. Corpus-based quality assessment of specialist translation: A study using parallel and comparable corpora in english and italian. In Insights into specialized translation–linguistics insights.
  37. Larry Selinker. 1972. Interlanguage. International Review of Applied Linguistics in Language Teaching (IRAL), 10(1-4):209–232.
  38. Multilingual translation from denoising pre-training. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3450–3466, Online. Association for Computational Linguistics.
  39. Mildred C Templin. 1957. Certain language skills in children: Their development and interrelationships, volume 10. JSTOR.
  40. Antonio Toral. 2019. Post-editese: an exacerbated translationese. In Proceedings of Machine Translation Summit XVII: Research Track, pages 273–281, Dublin, Ireland. European Association for Machine Translation.
  41. Gideon Toury. 2012. Descriptive translation studies: And beyond. Descriptive Translation Studies, pages 1–366.
  42. On the features of translationese. Digital Scholarship in the Humanities, 30(1):98–118.
  43. Xiaojun Wan. 2011. Using bilingual information for cross-language document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1546–1555, Portland, Oregon, USA. Association for Computational Linguistics.
  44. Cross-language document summarization based on machine translation quality prediction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 917–926, Uppsala, Sweden. Association for Computational Linguistics.
  45. Zero-shot cross-lingual summarization via large language models.
  46. ClidSum: A benchmark dataset for cross-lingual dialogue summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7716–7729, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  47. A Survey on Cross-Lingual Summarization. Transactions of the Association for Computational Linguistics, 10:1304–1323.
  48. Towards unifying multi-lingual and cross-lingual summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15127–15143, Toronto, Canada. Association for Computational Linguistics.
  49. Mixed-lingual pre-training for cross-lingual summarization. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 536–541, Suzhou, China. Association for Computational Linguistics.
  50. Phrase-based compressive cross-language summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 118–127, Lisbon, Portugal. Association for Computational Linguistics.
  51. Translate-train embracing translationese artifacts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 362–370, Dublin, Ireland. Association for Computational Linguistics.
  52. Kaizhong Zhang and Dennis Shasha. 1989. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245–1262.
  53. Mike Zhang and Antonio Toral. 2019. The effect of translationese in machine translation test sets. In Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pages 73–81, Florence, Italy. Association for Computational Linguistics.
  54. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  55. MediaSum: A large-scale media interview dataset for dialogue summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5927–5934, Online. Association for Computational Linguistics.
  56. MSMO: Multimodal summarization with multimodal output. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4154–4164, Brussels, Belgium. Association for Computational Linguistics.
  57. NCLS: Neural cross-lingual summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3054–3064, Hong Kong, China. Association for Computational Linguistics.
  58. Attend, translate and summarize: An efficient method for neural cross-lingual summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1309–1321, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jiaan Wang (35 papers)
  2. Fandong Meng (174 papers)
  3. Yunlong Liang (33 papers)
  4. Tingyi Zhang (4 papers)
  5. Jiarong Xu (24 papers)
  6. Zhixu Li (43 papers)
  7. Jie Zhou (687 papers)
Citations (13)