2000 character limit reached
The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations (2402.14616v1)
Published 22 Feb 2024 in cs.CL
Abstract: When deriving contextualized word representations from LLMs, a decision needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are segmented into subwords. What is the best way to represent these words with a single vector, and are these representations of worse quality than those of in-vocabulary words? We carry out an intrinsic evaluation of embeddings from different models on semantic similarity tasks involving OOV words. Our analysis reveals, among other interesting findings, that the quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words. Their similarity values, however, must be interpreted with caution.
- A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado. Association for Computational Linguistics.
- FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota. Association for Computational Linguistics.
- CoSimLex: A Resource for Evaluating Graded Word Similarity in Context. In Proceedings of The 12th Language Resources and Evaluation Conference, pages 5878–5886, Marseille, France. European Language Resources Association.
- Evaluating Tokenizers Impact on OOVs Representation with Transformers Models. In Proceedings of the Language Resources and Evaluation Conference, pages 4193–4204, Marseille, France. European Language Resources Association.
- Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Beijing.
- Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5:135–146.
- Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4758–4781, Online. Association for Computational Linguistics.
- Kaj Bostrom and Greg Durrett. 2020. Byte Pair Encoding is Suboptimal for Language Model Pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Kenneth Ward Church. 2020. Emerging trends: Subwords, seriously? Natural Language Engineering, 26(3):375–382.
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In ICLR.
- Alexis Conneau and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- One Size Does Not Fit All: Comparing NMT Representations of Different Granularities. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1504–1516, Minneapolis, Minnesota. Association for Computational Linguistics.
- CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6903–6915, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Investigations on Word Senses and Word Usages. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 10–18, Suntec, Singapore. Association for Computational Linguistics.
- Measuring Word Meaning in Context. Computational Linguistics, 39(3):511–554.
- Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Language, Speech, and Communication. MIT Press, Cambridge, MA.
- Philip Gage. 1994. A new algorithm for data compression. C Users Journal, 12(2):23–38.
- LADEC: The large database of English compounds. Behavior research methods, 51(5):2152–2179.
- Matthias Gallé. 2019. Investigating the Effectiveness of BPE: The Power of Shorter Sequences. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1375–1381, Hong Kong, China. Association for Computational Linguistics.
- Aina Garí Soler and Marianna Apidianaki. 2021. Let’s Play Mono-Poly: BERT Can Reveal Words’ Polysemy Level and Partitionability into Senses. Transactions of the Association for Computational Linguistics, 9:825–844.
- Word Usage Similarity Estimation with Sentence Representations and Automatic Substitutes. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019), pages 9–21, Minneapolis, Minnesota. Association for Computational Linguistics.
- One Word, Two Sides: Traces of Stance in Contextualized Word Representations. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3950–3959, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Analysing Lexical Semantic Change with Contextualised Word Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3960–3973, Online. Association for Computational Linguistics.
- Aaron Gokaslan and Vanya Cohen. 2019. OpenWebText corpus.
- Aurélie Herbelot and Marco Baroni. 2017. High-risk learning: acquiring new word vectors from tiny data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 304–309, Copenhagen, Denmark. Association for Computational Linguistics.
- SimLex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695.
- Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3594–3608, Online. Association for Computational Linguistics.
- An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 385–393, Dublin, Ireland. Association for Computational Linguistics.
- AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4692–4700, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Improving Word Representations via Global Context and Multiple Word Prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873–882, Jeju Island, Korea. Association for Computational Linguistics.
- Target-side Word Segmentation Strategies for Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation, pages 56–67, Copenhagen, Denmark. Association for Computational Linguistics.
- Breaking Character: Are Subwords Good Enough for MRLs After All? ArXiv, abs/2204.04748v1.
- Taku Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- Explaining and Improving BERT Performance on Lexical Semantic Change Detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 192–202, Online. Association for Computational Linguistics.
- Thomas K. Landauer and Susan T. Dumais. 1997. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological review, 104(2):211.
- Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1):147–165.
- When is Char Better Than Subword: A Systematic Study of Segmentation Algorithms for Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 543–549, Online. Association for Computational Linguistics.
- Learning embeddings for rare words leveraging Internet search engine and spatial location relationships. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, pages 278–287, Online. Association for Computational Linguistics.
- Towards Better Context-aware Lexical Semantics: Adjusting Contextualized Representations through Static Anchors. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4066–4075, Online. Association for Computational Linguistics.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.
- Better Word Representations with Recursive Neural Networks for Morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pages 104–113, Sofia, Bulgaria. Association for Computational Linguistics.
- BPE vs. Morphological Segmentation: A Case Study on Machine Translation of Four Polysynthetic Languages. In Findings of the Association for Computational Linguistics: ACL 2022, pages 961–971, Dublin, Ireland. Association for Computational Linguistics.
- Syrielle Montariol and Alexandre Allauzen. 2021. Measure and Evaluation of Semantic Divergence across Two Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1247–1258, Online. Association for Computational Linguistics.
- Fine-tuning de modèles de langues pour la veille épidémiologique multilingue avec peu de ressources (Fine-tuning Language Models for Low-resource Multilingual Epidemic Surveillance). In Actes de la 29e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale, pages 345–354, Avignon, France. ATALA.
- Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words. In Proceedings of the First Workshop on Insights from Negative Results in NLP, pages 1–5, Online. Association for Computational Linguistics.
- RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 17–20, Gothenburg, Sweden. Association for Computational Linguistics.
- Mohammad Taher Pilehvar and Jose Camacho-Collados. 2019. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, Minneapolis, Minnesota. Association for Computational Linguistics.
- Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1391–1401, Brussels, Belgium. Association for Computational Linguistics.
- Disambiguatory Signals are Stronger in Word-initial Positions. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 31–41, Online. Association for Computational Linguistics.
- Unseen Word Representation by Aligning Heterogeneous Lexical Semantic Spaces. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):6900–6907.
- Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. arXiv preprint arXiv:2003.07082.
- Language Models are Unsupervised Multitask Learners. OpenAI Blog, 1(8):9.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140):1–67.
- How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3118–3135, Online. Association for Computational Linguistics.
- MorphoLex: A derivational morphological database for 70,000 English words. Behavior Research Methods, 50:1568–1580.
- Timo Schick and Hinrich Schütze. 2020a. BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3996–4007, Online. Association for Computational Linguistics.
- Timo Schick and Hinrich Schütze. 2020b. Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8766–8774.
- DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7079–7091, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152. IEEE.
- Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- David J Sheskin. 2003. Handbook of parametric and nonparametric statistical procedures. Chapman and Hall/CRC.
- Robyn Speer. 2022. rspeer/wordfreq: v3.0 (v3.0.2). Zenodo.
- Probing Pretrained Language Models for Lexical Semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7222–7240, Online. Association for Computational Linguistics.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings. In Proceedings of the 15th Conference on Natural Language Processing (KONVENS 2019): Long Papers, pages 161–170, Erlangen, Germany. German Society for Computational Linguistics & Language Technology.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint:1609.08144.
- Zhibiao Wu and Martha Palmer. 1994. Verb Semantics and Lexical Selection. In 32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138, Las Cruces, New Mexico, USA. Association for Computational Linguistics.
- Semantic Similarity Computation in Knowledge Graphs: Comparisons and Improvements. In 2019 IEEE 35th International Conference on Data Engineering Workshops (ICDEW), pages 249–252.
- XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- BERTScore: Evaluating Text Generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15), page 19–27, Santiago, Chile. IEEE Computer Society.
- Aina Garí Soler (8 papers)
- Matthieu Labeau (15 papers)
- Chloé Clavel (39 papers)