Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency (2404.12145v1)

Published 18 Apr 2024 in cs.CL and cs.AI
From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

Abstract: The staggering pace with which the capabilities of LLMs are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what "understanding" means for a LLM and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes - inspired by Fregean senses - of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model's multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

Probing Semantic Understanding in LLMs through Multisense Consistency

Introduction

Advancements in LLMs have significantly enhanced their performance on various natural language understanding (NLU) benchmarks. However, these metrics do not fully address whether LLMs, such as GPT-3.5, truly understand the content they process or merely reproduce patterns found in the training data. Inspired by the philosophical theories of Frege and Wittgenstein concerning sense, reference, and meaning, our paper probes the depth of semantic understanding in LLMs by evaluating their consistency across multiple linguistic presentations—translations and paraphrases—of factual knowledge.

Methodology

Our research employs a novel assessment criterion named "multisense consistency," which refers to a model's ability to maintain consistency in its responses when faced with different linguistic presentations of the same semantic content. We explore this by:

  1. Generating Alternative Senses: Using the model itself to create paraphrases and translations of queries, ensuring that differences in responses are attributable to the model’s understanding rather than external paraphrasing disparities.
  2. Testing across Multiple Datasets: Implementing this methodology on a range of specifically curated 'Simple facts' datasets and existing NLU benchmarks, including translation-pairs and paraphrase-tests.
  3. Determining Consistency: We calculate consistency as a statistical measure of how often the model produces the same response to semantically equivalent inputs in different linguistic forms.

Results

Across various tests involving factual data (such as Simple facts about chemistry, arithmetic, geography, and historical events) as well as more complex NLU tasks (including paraphrase identification and logical inference), we detected notable inconsistencies in GPT-3.5's responses. Although the model often reached high performance in individual languages or forms, its answers varied when the same question was posed in different forms, indicating a significant form-dependent understanding. These findings were supported by further analysis, demonstrating that:

  • Paraphrases and Translations: Despite high-quality translations and paraphrases generated by the model, inaccuracies persisted, suggesting a deeper issue related to sense-making rather than surface-level language generation.
  • Task-Dependent Inconsistencies: Further disentangling revealed that inconsistencies partly stemmed from the model’s variable understanding and execution of tasks across different languages.

Discussion

The observed lack of multisense consistency brings to light the limitations of current LLMs in achieving a true, human-like grasp of semantics. Despite superficially proficient language generation capabilities, these models may not fully disentangle the meaning from the linguistic form, questioning their use in applications requiring deep semantic understanding or precise factual recall. The implications of our findings extend to the academic perspectives on AI's cognitive modeling and practical considerations in deploying LLMs for multilingual tasks where semantic integrity is crucial.

Concluding Remarks

This paper illuminates the semantic shortcomings of current state-of-the-art LLMs, highlighting the importance of developing new methodologies and training approaches that better encapsulate the essence of human-like language understanding. Future work should focus on enhancing the robustness of LLMs to variable linguistic presentations and further refining the paradigms used to test for genuine semantic competence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (120)
  1. Can language models encode perceptual structure without grounding? a case study in color. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 109–132, Association for Computational Linguistics, Online.
  2. Blackbox meets blackbox: Representational similarity & stability analysis of neural language models and brains. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 191–203, Association for Computational Linguistics, Florence, Italy.
  3. Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6168–6173, Association for Computational Linguistics, Florence, Italy.
  4. Syntactic surprisal from neural models predicts, but underestimates, human processing difficulty from syntactic ambiguities. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 301–313, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  5. Asai, Akari and Hannaneh Hajishirzi. 2020. Logic-guided data augmentation and regularization for consistent question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5642–5650, Association for Computational Linguistics, Online.
  6. Au, Terry K. and Mariana Glusman. 1990. The principle of mutual exclusivity in word learning: To honor or not to honor? Child Development, 61(5):1474–1490.
  7. Badre, David and Derek Evan Nee. 2018. Frontal cortex and the hierarchical control of behavior. Trends in Cognitive Sciences, 22(2):170–188.
  8. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. ArXiv Preprint, arXiv:2308.16884.
  9. Baroni, Marco. 2023. On the proper role of linguistically oriented deep net analysis in linguistic theorising. In Algebraic Structures in Natural Language, pages 1–16, CRC Press, Boca Raton, FL.
  10. Barsalou, Lawrence W. 2005. Abstraction as dynamic interpretation in perceptual symbol systems. In Building object categories in developmental time, pages 389–431, Lawrence Erlbaum Associates.
  11. Worldsense: A synthetic benchmark for grounded reasoning in large language models. ArXiv Preprint, arXiv:2311.15930.
  12. Bender, Emily M. and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Association for Computational Linguistics, Online.
  13. The reversal curse: LLMs trained on ”A is B” fail to learn ”B is A”. ArXiv Preprint, arXiv:2309.12288.
  14. Biever, Celeste. 2023. Chatgpt broke the turing test — the race is on for new ways to assess ai. Nature (News Feature), 619:686–689.
  15. Experience grounds language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8718–8735, Association for Computational Linguistics, Online.
  16. Zero-shot approach to overcome perturbation sensitivity of prompts. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5698–5711, Association for Computational Linguistics, Toronto, Canada.
  17. Chang, Tyler A. and Benjamin K. Bergen. 2023. Language model behavior: A comprehensive survey. ArXiv Preprint, arXiv:2303.11504.
  18. Can NLI models verify QA systems’ predictions? In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3841–3854, Association for Computational Linguistics, Punta Cana, Dominican Republic.
  19. Christiansen, Morten H and Nick Chater. 1999. Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23(2):157–205.
  20. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485, Association for Computational Linguistics, Brussels, Belgium.
  21. Large language models demonstrate the potential of statistical learning in language. Cognitive Science, 47(3):e13256.
  22. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Association for Computational Linguistics, Dublin, Ireland.
  23. Generalising to German plural noun classes, from the perspective of a recurrent neural network. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 94–108, Association for Computational Linguistics, Online.
  24. Shortcut learning of large language models in natural language understanding. ArXiv Preprint, arXiv:2208.11857.
  25. Dupoux, Emmanuel. 2018. Cognitive science in the era of artificial intelligence: A roadmap for reverse-engineering the infant language-learner. Cognition, 173:43–59.
  26. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
  27. Elman, Jeffrey L. 1990. Finding structure in time. Cognitive Science, 14(2):179–211.
  28. Francis, Wendy S. 2009. Bilingual semantic and conceptual representation. In Handbook of Bilingualism, pages 251–267, Oxford University Press.
  29. Frank, Stefan and Rens Bod. 2011. Insensitivity of the human sentence-processing system to hierarchical structure. Psychological Science, 22(6):829–834.
  30. Frege, Gottlob. 1892. Über Sinn und Bedeutung [”On sense and reference”]. Zeitschrift für Philosophie und philosophische Kritik, 100(1):25–50.
  31. Futrell, Richard and Roger Levy. 2017. Noisy-context surprisal as a human sentence processing cost model. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 688–698, Association for Computational Linguistics, Valencia, Spain.
  32. Gentner, Dedre and Christian Hoyos. 2017. Analogy and abstraction. Topics in Cognitive Science, 9(3):672–693.
  33. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Association for Computational Linguistics, Hong Kong, China.
  34. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  35. Under the hood: Using diagnostic classifiers to investigate and improve how language models track agreement information. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 240–248, Association for Computational Linguistics, Brussels, Belgium.
  36. Colorless green recurrent networks dream hierarchically. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1195–1205, Association for Computational Linguistics, New Orleans, Louisiana.
  37. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, Association for Computational Linguistics, New Orleans, Louisiana.
  38. The effect of scaling, retrieval augmentation and form on the factual consistency of language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5457–5476, Association for Computational Linguistics, Singapore.
  39. Heineman, David. 2023. Rethinking reasoning evaluation with theories of intelligence. https://davidheineman.com/reasoning-evaluation.pdf.
  40. Measuring massive multitask language understanding. In International Conference on Learning Representations (ICLR).
  41. The emergence of competing modules in bilingualism. Trends in Cognitive Sciences, 9(5):220–225.
  42. Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1301–1312, Association for Computational Linguistics, Online.
  43. XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4411–4421, PMLR.
  44. Surprisal does not explain syntactic disambiguation difficulty: evidence from a large-scale benchmark. PsyArXiv Preprint, z38u6.
  45. Hupkes, Dieuwke. 2020. Hierarchy and interpretability in neural models of language processing. Ph.D. thesis, University of Amsterdam.
  46. State-of-the-art generalisation research in NLP: A taxonomy and review. ArXiv Preprint, arXiv:2210.03050.
  47. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43.
  48. BECEL: Benchmark for consistency evaluation of language models. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3680–3696, International Committee on Computational Linguistics, Gyeongju, Republic of Korea.
  49. Jang, Myeongjun and Thomas Lukasiewicz. 2023. Consistency analysis of ChatGPT. ArXiv Preprint, arXiv:2303.06273.
  50. Johnson-Laird, Philip N. and Marco Ragni. 2023. What should replace the Turing Test? Intelligent Computing, 2:1–2.
  51. Language models use monotonicity to assess NPI licensing. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4958–4969, Association for Computational Linguistics, Online.
  52. Kassner, Nora and Hinrich Schütze. 2020. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7811–7818, Association for Computational Linguistics, Online.
  53. BeliefBank: Adding memory to a pre-trained language model for a systematic notion of belief. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8849–8861, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  54. The defeat of the Winograd Schema Challenge. ArXiv Preprint, arXiv:2201.02387.
  55. Kroll, Judith F and Annette MB De Groot. 1997. Lexical and conceptual memory in the bilingual: Mapping form to meaning in two languages. In Tutorials in bilingualism. Lawrence Erlbaum, Mahwah NJ, pages 169–199.
  56. Can transformers process recursive nested constructions, like humans? In Proceedings of the 29th International Conference on Computational Linguistics, pages 3226–3232, International Committee on Computational Linguistics, Gyeongju, Republic of Korea.
  57. Mechanisms for handling nested dependencies in neural-network language models and humans. Cognition, 213:104699. Special Issue in Honour of Jacques Mehler, Cognition’s founding editor.
  58. The emergence of number and syntax units in LSTM language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 11–20, Association for Computational Linguistics, Minneapolis, Minnesota.
  59. A logic-driven framework for consistency of neural models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3924–3935, Association for Computational Linguistics, Hong Kong, China.
  60. Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
  61. XGLUE: A new benchmark dataset for cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6008–6018, Association for Computational Linguistics, Online.
  62. Lin, Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Association for Computational Linguistics, Barcelona, Spain.
  63. Linzen, Tal and Marco Baroni. 2021. Syntactic structure from deep learning. Annual Review of Linguistics, 7:195–212.
  64. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
  65. Human replay spontaneously reorganizes experience. Cell, 178(3):640–652.e14.
  66. Dissociating language and thought in large language models. ArXiv Preprint, arXiv:2301.06627.
  67. Malouf, Robert. 2017. Abstractive morphological learning with a recurrent neural network. Morphology, 27:431–458.
  68. Mandelkern, Matthew and Tal Linzen. 2023. Do language models refer? ArXiv Preprint, arXiv:2308.05576.
  69. Embers of autoregression: Understanding large language models through the problem they are trained to solve. ArXiv Preprint, arXiv:2309.13638.
  70. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Association for Computational Linguistics, Florence, Italy.
  71. Sources of hallucination by large language models on inference tasks. ArXiv Preprint, arXiv:2303.06273.
  72. Hippocampal representation of related and opposing memories develop within distinct, hierarchically organized neural schemas. Neuron, 83(1):202–215.
  73. Minervini, Pasquale and Sebastian Riedel. 2018. Adversarially regularising neural NLI models to integrate logical background knowledge. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 65–74, Association for Computational Linguistics, Brussels, Belgium.
  74. Enhancing self-consistency and performance of pre-trained language models through natural language inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1754–1768, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates.
  75. Mitchell, Melanie and David C. Krakauer. 2023. The debate over understanding in AI’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120.
  76. State of what art? a call for multi-prompt llm evaluation. ArXiv Preprint, arXiv:2401.00595.
  77. Mollo, Dimitri Coelho and Raphaël Millière. 2023. The vector grounding problem. ArXiv Preprint, arXiv:2304.01481.
  78. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Association for Computational Linguistics, Online.
  79. Niven, Timothy and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4658–4664, Association for Computational Linguistics, Florence, Italy.
  80. Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses. In Proceedings of the 3rd Workshop on Natural Language Generation, Evaluation and Metrics (GEM at EMNLP), Association for Computational Linguistics, Singapore.
  81. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744, Curran Associates, Inc.
  82. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Association for Computational Linguistics, Philadelphia, Pennsylvania, USA.
  83. Patel, Roma and Ellie Pavlick. 2022. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations.
  84. Pavlick, Ellie. 2023. Symbols and grounding in large language models. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 381(2251):20220041.
  85. Piantadosi, Steven. 2023. Modern language models refute Chomsky’s approach to language. Lingbuzz Preprint, 7180.
  86. Piantadosi, Steven and Felix Hill. 2022. Meaning without reference in large language models. In NeurIPS 2022 Workshop on Neuro Causal and Symbolic AI (nCSI).
  87. How can the [mask] know? the sources and limitations of knowledge in bert. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
  88. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2362–2376, Association for Computational Linguistics, Online.
  89. Cross-lingual consistency of factual knowledge in multilingual language models. ArXiv Preprint, arXiv:2310.10478.
  90. AI and the everything in the whole wide world benchmark. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  91. Machine reading, fast and slow: When do models “understand” language? In Proceedings of the 29th International Conference on Computational Linguistics, pages 78–93, International Committee on Computational Linguistics, Gyeongju, Republic of Korea.
  92. COMET-22: Unbabel-IST 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid).
  93. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Proceedings of the 2011 AAAI Spring Symposium Series.
  94. XTREME-R: Towards more challenging and nuanced multilingual evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10215–10245, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic.
  95. Ryu, Soo Hyun and Richard Lewis. 2021. Accounting for agreement phenomena in sentence comprehension with transformer language models: Effects of similarity-based interference on surprisal and attention. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 61–71, Association for Computational Linguistics, Online.
  96. PECO: Examining single sentence label leakage in natural language inference datasets through progressive evaluation of cluster outliers. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 3061–3074, Association for Computational Linguistics, Dubrovnik, Croatia.
  97. Sen, Priyanka and Amir Saffari. 2020. What do models learn from question answering datasets? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2429–2438, Association for Computational Linguistics, Online.
  98. The validity of evaluation results: Assessing concurrence across compositionality benchmarks. In Proceedings of CoNLL 2023.
  99. Tafjord, Oyvind and Peter Clark. 2021. General-purpose question-answering with Macaw. ArXiv Preprint, arXiv:2109.02593.
  100. How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022):1279–1285.
  101. Timkey, William and Tal Linzen. 2023. A language model with limited memory capacity captures interference in human sentence processing. ArXiv Preprint, arXiv:2310.16142.
  102. Llama: Open and efficient foundation language models. ArXiv Preprint, arXiv:2302.13971.
  103. Assessing incrementality in sequence-to-sequence models. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 209–217, Association for Computational Linguistics, Florence, Italy.
  104. Neural representation of abstract task structure during generalization. eLife, 10:e63226.
  105. Van Schijndel, Marten and Tal Linzen. 2018. Modeling garden path effects without explicit hierarchical syntax. In CogSci.
  106. Van Schijndel, Marten and Tal Linzen. 2021. Single-stage prediction models do not explain the magnitude of syntactic disambiguation difficulty. Cognitive Science, 45(6):e12988.
  107. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS), Curran Associates Inc., Red Hook, NY, USA.
  108. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Association for Computational Linguistics, Brussels, Belgium.
  109. Adversarial GLUE: A multi-task benchmark for robustness evaluation of language models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  110. What if we simply swap the two text fragments? A straightforward yet effective way to test the robustness of methods to confounding signals in nature language inference tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7136–7143.
  111. Are large language models really robust to word-level perturbations? ArXiv Preprint, arXiv:2309.11166.
  112. Finding skill neurons in pre-trained transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11132–11152, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates.
  113. Warstadt, Alex and Samuel R. Bowman. 2022. What artificial neural networks can tell us about human language acquisition. ArXiv Preprint, arXiv:2208.07998.
  114. Call for papers – The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. ArXiv Preprint, arXiv:2301.11796.
  115. Mind the instructions: a holistic evaluation of consistency and interactions in prompt-based learning. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 294–313, Association for Computational Linguistics, Singapore.
  116. On the predictive power of neural language models for human real-time comprehension behavior. ArXiv Preprint, arXiv:2006.01912.
  117. Wittgenstein, Ludwig. 1953. Philosophical investigations. Philosophische Untersuchungen. Macmillan.
  118. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Association for Computational Linguistics, Hong Kong, China.
  119. When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1112–1125, Association for Computational Linguistics, Online.
  120. PAWS: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308, Association for Computational Linguistics, Minneapolis, Minnesota.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xenia Ohmer (7 papers)
  2. Elia Bruni (32 papers)
  3. Dieuwke Hupkes (49 papers)
Citations (4)