OpenMSD: Towards Multilingual Scientific Documents Similarity Measurement (2309.10539v1)
Abstract: We develop and evaluate multilingual scientific documents similarity measurement models in this work. Such models can be used to find related works in different languages, which can help multilingual researchers find and explore papers more efficiently. We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778M citation pairs. With OpenMSD, we pretrain science-specialized LLMs, and explore different strategies to derive "related" paper pairs to fine-tune the models, including using a mixture of citation, co-citation, and bibliographic-coupling pairs. To further improve the models' performance for non-English papers, we explore the use of generative LLMs to enrich the non-English papers with English summaries. This allows us to leverage the models' English capabilities to create better representations for non-English papers. Our best model significantly outperforms strong baselines by 7-16% (in mean average precision).
- PaLM 2 Technical Report.
- arXiv.org. 2023. arXiv Dataset.
- Docasref: A pilot empirical study on repurposing reference-based summary quality metrics reference-freely. arXiv preprint arXiv:2212.10013.
- Angathevar Baskaran. 2016. UNESCO science report: Towards 2030. Institutions and Economies, pages 125–127.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3615–3620, Hong Kong, China. Association for Computational Linguistics.
- Bethesda. 2003. PMC Open Access Subset. https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist. [Online; accessed 23-June-2023].
- The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics. In LREC.
- Growth rates of modern science: a latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications, 8(1):1–15.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
- Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 539–546. IEEE.
- SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451.
- Crossref. 2022. April 2022 public data file from crossref.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Mario S Di Bitetti and Julián A Ferreras. 2017. Publish (in English) or perish: The effect on citation rate of using languages other than English in scientific publications. Ambio, 46:121–127.
- Mona Diab and Martha Yifru. 2022. ACL 2022 D&I Special Initiative: 60-60, Globalization via Localization. https://www.2022.aclweb.org/dispecialinitiative. [Online; accessed 23-June-2023].
- Science of science. Science, 359(6379):eaao0185.
- Supert: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1347–1354.
- End-to-end retrieval in continuous space. arXiv preprint arXiv:1811.08008.
- Declutr: Deep contrastive learning for unsupervised textual representations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 879–895.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
- Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652.
- Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pages 4411–4421. PMLR.
- Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
- Maxwell Mirton Kessler. 1963. Bibliographic coupling between scientific papers. American documentation, 14(1):10–25.
- The Semantic Scholar Open Data Platform. arXiv e-prints, page arXiv:2301.10140.
- Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
- PyTorch-BigGraph: A Large-scale Graph Embedding System. arXiv e-prints, page arXiv:1903.12287.
- Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
- Weishu Liu. 2017. The changing role of non-english papers in scholarly communication: Evidence from web of science’s three journal citation indexes. Learned Publishing, 30(2):115–123.
- S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online. Association for Computational Linguistics.
- Melissa C Márquez and Ana Maria Porras. 2020. Science communication in multiple languages is critical to its effectiveness. Frontiers in Communication, page 31.
- Olga Moskaleva and Mark Akoev. 2019. Non-english language publications in citation indexes–quantity and quality. In 17th International Conference on Scientometrics and Informetrics, ISSI 2019, pages 35–46. International Society for Scientometrics and Informetrics.
- Multi-vector models with textual guidance for fine-grained scientific document similarity. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4453–4470, Seattle, United States. Association for Computational Linguistics.
- Jeppe Nicolaisen. 2007. Citation analysis. Annual review of information science and technology, 41(1):609–641.
- Neighborhood contrastive learning for scientific document representations with citation embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11670–11688, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Silvio Peroni and David Shotton. 2020. OpenCitations, an infrastructure organization for open scholarship. Quantitative Science Studies, 1(1):428–444.
- Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710.
- Instantembedding: Efficient local node representations. arXiv preprint arXiv:2010.06992.
- The ACL Anthology network. In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries (NLPIR4DL), pages 54–61, Suntec City, Singapore. Association for Computational Linguistics.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
- Henry Small. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for information Science, 24(4):265–269.
- Overcoming language barriers in academia: machine translation tools and a vision for a multilingual future. BioScience, 72(10):988–998.
- You reap what you sow: On the challenges of bias evaluation under multilingual settings. In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 26–41, virtual+Dublin. Association for Computational Linguistics.
- Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
- Sedef Uzuner. 2008. Multilingual scholars’ participation in core/global academic communities: A literature review. Journal of English for academic Purposes, 7(4):250–263.
- Attention is all you need. Advances in neural information processing systems, 30.
- MACRONYM: A large-scale dataset for multilingual and multi-domain acronym extraction. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3309–3314, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 363–372.
- Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
- A survey on cross-lingual summarization. Transactions of the Association for Computational Linguistics, 10:1304–1323.
- English contrastive learning can learn universal cross-lingual sentence embeddings. arXiv preprint arXiv:2211.06127.
- A latent semantic indexing-based approach to multilingual document clustering. Decision Support Systems, 45(3):606–620.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3733–3742.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. In International Conference on Learning Representations.
- mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
- Benchmarking Large Language Models for News Summarization. arXiv e-prints, page arXiv:2301.13848.
- NCLS: Neural cross-lingual summarization. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3054–3064, Hong Kong, China. Association for Computational Linguistics.
- Pranas Zunde. 1971. Structural models of complex information sources. Information storage and retrieval, 7(1):1–18.