Papers
Topics
Authors
Recent
Search
2000 character limit reached

Since the Scientific Literature Is Multilingual, Our Models Should Be Too

Published 27 Mar 2024 in cs.CL | (2403.18251v1)

Abstract: English has long been assumed the $\textit{lingua franca}$ of scientific research, and this notion is reflected in the NLP research involving scientific document representation. In this position piece, we quantitatively show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity. We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain. We end with suggestions for the NLP community on how to improve performance on non-English documents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Building Machine Translation Systems for the Next Thousand Languages.
  2. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
  3. Emily Bender. 2019. The #BenderRule: On naming the languages we study and why it matters.
  4. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.
  5. Language Models are Few-Shot Learners.
  6. TLDR: Extreme Summarization of Scientific Documents. Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4766–4777.
  7. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
  8. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. ELRA. 2019. International Conference Language Technologies for All (LT4All). https://en.unesco.org/LT4All.
  11. Openmsd: Towards multilingual scientific documents similarity measurement. ArXiv, abs/2309.10539.
  12. Linguistic Injustice in the Writing of Research Articles in English as a Second Language: Data From Taiwanese and Mexican Researchers. Written Communication, 36(1):136–154.
  13. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  14. The semantic scholar open data platform. ArXiv, abs/2301.10140.
  15. Melissa C. Márquez and Ana Maria Porras. 2020. Science Communication in Multiple Languages Is Critical to Its Effectiveness. In Frontiers in Communication, volume 5, page 31.
  16. No Language Left Behind: Scaling Human-Centered Machine Translation.
  17. Neighborhood contrastive learning for scientific document representations with citation embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11670–11688, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  18. Language Models are Unsupervised Multitask Learners.
  19. Valeria Ramírez-Castañeda. 2020. Disadvantages in preparing and publishing scientific papers caused by the dominance of the English language in science: The case of Colombian researchers in biological sciences. PLOS ONE, 15(9):e0238372.
  20. Masked Language Model Scoring. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2699–2712, Online. Association for Computational Linguistics.
  21. SciRepEval: A Multi-Format Benchmark for Scientific Document Representations.
  22. Overcoming Language Barriers in Academia: Machine Translation Tools and a Vision for a Multilingual Future. BioScience, 72(10):988–998.
  23. Alex D Wade. 2022. The semantic scholar academic graph (s2ag). Companion Proceedings of the Web Conference 2022.
  24. ProNE: Fast and Scalable Network Representation Learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 4278–4284, Macao, China. International Joint Conferences on Artificial Intelligence Organization.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.