2000 character limit reached
Revisiting subword tokenization: A case study on affixal negation in large language models (2404.02421v2)
Published 3 Apr 2024 in cs.CL
Abstract: In this work, we measure the impact of affixal negation on modern English LLMs. In affixal negation, the negated meaning is expressed through a negative morpheme, which is potentially challenging for LLMs as their tokenizers are often not morphologically plausible. We conduct extensive experiments using LLMs with different subword tokenization methods, which lead to several insights on the interaction between tokenization performance and negation sensitivity. Despite some interesting mismatches between tokenization accuracy and negation detection performance, we show that models can, on the whole, reliably recognize the meaning of affixal negation.
- SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta. European Language Resources Association (ELRA).
- The SIGMORPHON 2022 shared task on morpheme segmentation. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 103–116, Seattle, Washington. Association for Computational Linguistics.
- Eduardo Blanco and Dan Moldovan. 2011. Some issues on detecting negation from text. In Twenty-Fourth International FLAIRS Conference.
- Kaj Bostrom and Greg Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4617–4624, Online. Association for Computational Linguistics.
- Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416.
- ELECTRA: pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations (ICLR 2020).
- Alexandre Cremers. 2022. Interpreting gradable adjectives: rational reasoning or simple heuristics? Empirical Issues in Syntax and Semantics, 14:31–61.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- How much does tokenization affect neural machine translation? In International Conference on Computational Linguistics and Intelligent Text Processing, pages 545–554. Springer.
- Allyson Ettinger. 2020. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8:34–48.
- Philip Gage. 1994. A new algorithm for data compression. C Users Journal, 12(2):23–38.
- Morfessor EM+Prune: Improved subword segmentation with expectation maximization and pruning. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3944–3953, Marseille, France. European Language Resources Association.
- Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 1177–1185, Dublin, Ireland. Dublin City University and Association for Computational Linguistics.
- An analysis of negation in natural language understanding corpora. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 716–723, Dublin, Ireland. Association for Computational Linguistics.
- An analysis of natural language inference benchmarks through the lens of negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9106–9118, Online. Association for Computational Linguistics.
- Shrikant Joshi. 2012. Affixal negation–direct, indirect and their subtypes 1. Syntaxe & sémantique, (1):49–63.
- Nora Kassner and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818, Online. Association for Computational Linguistics.
- Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
- Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
- ALBERT: A lite BERT for self-supervised learning of language representations. In 8th International Conference on Learning Representations (ICLR 2020).
- Biolemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of Biomedical Semantics, 3:3.
- RoBERTa: A robustly optimized BERT pretraining approach. ArXiv preprint, abs/1907.11692.
- George A Miller. 1995. WordNet: a lexical database for English. Communications of the ACM, 38(11):39–41.
- OpenAI. 2023. GPT-4 technical report.
- Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
- Language models are unsupervised multitask learners. OpenAI blog.
- CONDAQA: A contrastive reading comprehension dataset for reasoning about negation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8729–8755, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Jonne Saleva and Constantine Lignos. 2021. The effectiveness of morphology-aware segmentation in low-resource neural machine translation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 164–174, Online. Association for Computational Linguistics.
- Inseq: An interpretability toolkit for sequence generation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 421–435, Toronto, Canada. Association for Computational Linguistics.
- Mike Schuster and Kaisuke Nakajima. 2012. Japanese and korean voice search. In International Conference on Acoustics, Speech and Signal Processing, pages 5149–5152.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computational Linguistics.
- ScoNe: Benchmarking negation reasoning in language models with fine-tuning and in-context learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1803–1821, Toronto, Canada. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR.
- Linear representations of sentiment in large language models.
- LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Language models are not naysayers: an analysis of language models on negation benchmarks. In Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pages 101–114, Toronto, Canada. Association for Computational Linguistics.
- Not another negation benchmark: The NaN-NLI test suite for sub-clausal negation. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 883–894, Online only. Association for Computational Linguistics.
- Building a dictionary of affixal negations. In Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics (ExProM), pages 49–56, Osaka, Japan. The COLING 2016 Organizing Committee.
- A survey on the role of negation in sentiment analysis. In Proceedings of the Workshop on Negation and Speculation in Natural Language Processing, pages 60–68, Uppsala, Sweden. University of Antwerp.
- XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5754–5764.
- Thinh Hung Truong (9 papers)
- Yulia Otmakhova (12 papers)
- Karin Verspoor (34 papers)
- Trevor Cohn (105 papers)
- Timothy Baldwin (125 papers)