Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings? (2304.14796v1)
Abstract: Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks.
- Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610.
- Longformer: The long-document transformer. CoRR, abs/2004.05150.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
- A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
- Christian Buck and Philipp Koehn. 2016a. Findings of the WMT 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 554–563, Berlin, Germany. Association for Computational Linguistics.
- Christian Buck and Philipp Koehn. 2016b. Quick and reliable document alignment via TF/IDF-weighted cosine distance. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 672–678, Berlin, Germany. Association for Computational Linguistics.
- Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 169–174, Brussels, Belgium. Association for Computational Linguistics.
- Kathryn Annette Chapman and Günter Neumann. 2020. Automatic ICD Code Classification with Label Description Attention Mechanism. In IberLEF@ SEPLN, volume 2664 of CEUR Workshop Proceedings, pages 477–488. CEUR-WS.org.
- Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, Dublin, Ireland. Association for Computational Linguistics.
- Bitext mining using distilled sentence representations for low-resource languages. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2101–2112, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Skip-thought vectors. Advances in neural information processing systems, 28.
- Jey Han Lau and Timothy Baldwin. 2016. An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 78–86, Berlin, Germany. Association for Computational Linguistics.
- Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 1188–1196, Bejing, China. PMLR.
- Weight attention layer-based document classification incorporating information gain. Expert Systems, 39(1):e12833.
- Wei Li and Brian Kan-Wing Mak. 2020. Transformer based multilingual document embedding model. ArXiv, abs/2008.08567.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26:3111–3119.
- Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results. In IberLEF@SEPLN, volume 2664 of CEUR Workshop Proceedings, pages 303–323. CEUR-WS.org.
- Non-technical Summaries (NTS) of Animal Experiments Indexed with ICD-10 Codes (Version 1.0).
- Efficient classification of long documents using transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 702–709, Dublin, Ireland. Association for Computational Linguistics.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525, Online. Association for Computational Linguistics.
- Manuel Romero. 2022. Spanish LongFormer. https://huggingface.co/mrm8488/longformer-base-4096-spanish.
- Markus Sagen. 2021. Large-context question answering with cross-lingual transfer. Master’s thesis, Uppsala University, Department of Information Technology.
- Holger Schwenk and Matthijs Douze. 2017. Learning joint multilingual sentence representations with neural machine translation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 157–167, Vancouver, Canada. Association for Computational Linguistics.
- Holger Schwenk and Xian Li. 2018. A corpus for multilingual document classification in eight languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Liu Shen. 2021. Chinese LongFormer. https://huggingface.co/schen/longformer-chinese-base-4096.
- How to fine-tune bert for text classification? In China national conference on Chinese computational linguistics, pages 194–206. Springer.
- Brian Thompson and Philipp Koehn. 2020. Exploiting sentence order in document alignment. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5997–6007, Online. Association for Computational Linguistics.
- Attention is all you need. Advances in neural information processing systems, 30.
- David Vose. 2008. Risk analysis: a quantitative guide. John Wiley & Sons.
- Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768.
- Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc.
- Sonal Sannigrahi (7 papers)
- Josef van Genabith (43 papers)
- Cristina Espana-Bonet (2 papers)