Probing Semantic Understanding in LLMs through Multisense Consistency
Introduction
Advancements in LLMs have significantly enhanced their performance on various natural language understanding (NLU) benchmarks. However, these metrics do not fully address whether LLMs, such as GPT-3.5, truly understand the content they process or merely reproduce patterns found in the training data. Inspired by the philosophical theories of Frege and Wittgenstein concerning sense, reference, and meaning, our paper probes the depth of semantic understanding in LLMs by evaluating their consistency across multiple linguistic presentations—translations and paraphrases—of factual knowledge.
Methodology
Our research employs a novel assessment criterion named "multisense consistency," which refers to a model's ability to maintain consistency in its responses when faced with different linguistic presentations of the same semantic content. We explore this by:
- Generating Alternative Senses: Using the model itself to create paraphrases and translations of queries, ensuring that differences in responses are attributable to the model’s understanding rather than external paraphrasing disparities.
- Testing across Multiple Datasets: Implementing this methodology on a range of specifically curated 'Simple facts' datasets and existing NLU benchmarks, including translation-pairs and paraphrase-tests.
- Determining Consistency: We calculate consistency as a statistical measure of how often the model produces the same response to semantically equivalent inputs in different linguistic forms.
Results
Across various tests involving factual data (such as Simple facts
about chemistry, arithmetic, geography, and historical events) as well as more complex NLU tasks (including paraphrase identification and logical inference), we detected notable inconsistencies in GPT-3.5's responses. Although the model often reached high performance in individual languages or forms, its answers varied when the same question was posed in different forms, indicating a significant form-dependent understanding. These findings were supported by further analysis, demonstrating that:
- Paraphrases and Translations: Despite high-quality translations and paraphrases generated by the model, inaccuracies persisted, suggesting a deeper issue related to sense-making rather than surface-level language generation.
- Task-Dependent Inconsistencies: Further disentangling revealed that inconsistencies partly stemmed from the model’s variable understanding and execution of tasks across different languages.
Discussion
The observed lack of multisense consistency brings to light the limitations of current LLMs in achieving a true, human-like grasp of semantics. Despite superficially proficient language generation capabilities, these models may not fully disentangle the meaning from the linguistic form, questioning their use in applications requiring deep semantic understanding or precise factual recall. The implications of our findings extend to the academic perspectives on AI's cognitive modeling and practical considerations in deploying LLMs for multilingual tasks where semantic integrity is crucial.
Concluding Remarks
This paper illuminates the semantic shortcomings of current state-of-the-art LLMs, highlighting the importance of developing new methodologies and training approaches that better encapsulate the essence of human-like language understanding. Future work should focus on enhancing the robustness of LLMs to variable linguistic presentations and further refining the paradigms used to test for genuine semantic competence.