Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Elements of World Knowledge (EWOK): A cognition-inspired framework for evaluating basic world knowledge in language models (2405.09605v1)

Published 15 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The ability to build and leverage world models is essential for a general-purpose AI agent. Testing such capabilities is hard, in part because the building blocks of world models are ill-defined. We present Elements of World Knowledge (EWOK), a framework for evaluating world modeling in LLMs by testing their ability to use knowledge of a concept to match a target text with a plausible/implausible context. EWOK targets specific concepts from multiple knowledge domains known to be vital for world modeling in humans. Domains range from social interactions (help/hinder) to spatial relations (left/right). Both, contexts and targets are minimal pairs. Objects, agents, and locations in the items can be flexibly filled in enabling easy generation of multiple controlled datasets. We then introduce EWOK-CORE-1.0, a dataset of 4,374 items covering 11 world knowledge domains. We evaluate 20 openweights LLMs (1.3B--70B parameters) across a battery of evaluation paradigms along with a human norming study comprising 12,480 measurements. The overall performance of all tested models is worse than human performance, with results varying drastically across domains. These data highlight simple cases where even large models fail and present rich avenues for targeted research on LLM world modeling capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Can language models encode perceptual structure without grounding? a case study in color. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 109–132, Online. Association for Computational Linguistics.
  2. AI@Meta. 2024. Llama 3 model card. GitHub.
  3. Falcon-40B: an open large language model with state-of-the-art performance. Findings of the Association for Computational Linguistics: ACL, 2023:10755–10773.
  4. Simulation as an engine of physical scene understanding. Proceedings of the National Academy of Sciences, 110(45):18327–18332.
  5. Emily M Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 5185–5198.
  6. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
  7. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642.
  8. Theory of mind. Wiley Interdisciplinary Reviews: Cognitive Science, 4(4):391–402.
  9. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2475–2485.
  10. Recognizing textual entailment: Rational, evaluation and approaches–erratum. Natural Language Engineering, 16(1):105–105.
  11. Stanislas Dehaene. 2011. The number sense: How the mind creates mathematics. OUP USA.
  12. SyntaxGym: An online platform for targeted evaluation of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 70–76.
  13. Gemma: Open models based on Gemini research and technology. arXiv preprint arXiv:2403.08295.
  14. Jonathan Gordon and Benjamin Van Durme. 2013. Reporting bias and knowledge acquisition. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 25–30.
  15. Semantic projection recovers rich human knowledge of multiple object features from word embeddings. Nature human behaviour, 6(7):975–987.
  16. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  17. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112.
  18. David Ha and Jürgen Schmidhuber. 2018. World models. arXiv preprint arXiv:1803.10122.
  19. Alon Hafri and Chaz Firestone. 2021. The perception of relations. Trends in Cognitive Sciences, 25(6):475–492.
  20. Reasoning with language model is planning with world model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8154–8173.
  21. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  22. Jennifer Hu and Michael C Frank. 2024. Auxiliary task demands mask the capabilities of smaller language models. arXiv preprint arXiv:2404.02418.
  23. Jennifer Hu and Roger Levy. 2023. Prompting is not a substitute for probability measurements in large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5040–5060, Singapore. Association for Computational Linguistics.
  24. Language models align with human judgments on key grammatical constructions. arXiv preprint arXiv:2402.01676.
  25. Log probability scores provide a closer match to human plausibility judgments than prompt-based evaluations. In South NLP Symposium.
  26. Ray Jackendoff. 2002. Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford University Press.
  27. Mistral 7B. arXiv preprint arXiv:2310.06825.
  28. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  29. Comparing plausibility estimates in base and instruction-tuned large language models. arXiv preprint arXiv:2403.14859.
  30. Event knowledge in large language models: The gap between the impossible and the unlikely. Cognitive Science, 47(11):e13386.
  31. Divyansh Kaushik and Zachary C Lipton. 2018. How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5010–5015.
  32. Michal Kosinski. 2023. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083, 4:169.
  33. Andrew Kyle Lampinen. 2022. Can language models handle recursively nested grammatical structures? a case study on comparing models and humans. arXiv preprint arXiv:2210.15303.
  34. Yann LeCun. 2022. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1).
  35. The Winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
  36. Distributional semantics as a source of visual knowledge. Proceedings of the National Academy of Sciences, 116(39):19237–19238.
  37. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  38. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  39. Evaluating statistical language models as pragmatic reasoners. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 45.
  40. Naive psychology depends on naive physics.
  41. HypoNLI: Exploring the artificial patterns of hypothesis-only bias in natural language inference. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6852–6860, Marseille, France. European Language Resources Association.
  42. Are distributional representations ready for the real world? Evaluating word vectors for grounded perceptual meaning. In Proceedings of the First Workshop on Language Grounding for Robotics, pages 76–85, Vancouver, Canada. Association for Computational Linguistics.
  43. Dissociating language and thought in large language models. Trends in Cognitive Sciences, 28(6):460–476.
  44. Michael McCloskey. 1983. Intuitive physics. Scientific american, 248(4):122–131.
  45. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, pages 3428–3448. Association for Computational Linguistics (ACL).
  46. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35:17359–17372.
  47. Mass editing memory in a transformer. The Eleventh International Conference on Learning Representations (ICLR).
  48. Quantitative analysis of culture using millions of digitized books. science, 331(6014):176–182.
  49. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
  50. COMPS: Conceptual minimal pair sentences for testing robust property knowledge and its inheritance in pre-trained language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2928–2949, Dubrovnik, Croatia. Association for Computational Linguistics.
  51. MosaicML. 2023. Introducing MPT-30B: Raising the bar for open-source foundation models. MosaicML.
  52. LLM evaluators recognize and favor their own generations. arXiv preprint arXiv:2404.13076.
  53. Roma Patel and Ellie Pavlick. 2021. Mapping language models to grounded conceptual spaces. In International conference on learning representations.
  54. Ellie Pavlick. 2022. Semantic structure in deep learning. Annual Review of Linguistics, 8:447–471.
  55. Did the cat drink the coffee? Challenging transformers with generalized event knowledge. In Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, pages 1–11, Online. Association for Computational Linguistics.
  56. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, New Orleans, Louisiana. Association for Computational Linguistics.
  57. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  58. Brett D. Roads and Bradley C. Love. 2020. Learning as the unsupervised alignment of conceptual systems. Nature Machine Intelligence, 2(1):76–82.
  59. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  60. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computational Linguistics.
  61. Large-scale evidence for logarithmic effects of word predictability on reading time. Proceedings of the National Academy of Sciences, 121(10):e2307876121.
  62. Vered Shwartz and Yejin Choi. 2020. Do neural language models overcome reporting bias? In Proceedings of the 28th International Conference on Computational Linguistics, pages 6863–6870, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  63. Neural representational geometry underlies few-shot concept learning. Proceedings of the National Academy of Sciences, 119(43):e2200800119.
  64. Elizabeth S. Spelke and Katherine D. Kinzler. 2007. Core knowledge. Developmental science, 10(1):89–96.
  65. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399.
  66. Akira Utsumi. 2020. Exploring what is encoded in distributional word vectors: A neurobiologically motivated analysis. Cognitive Science, 44(6):e12844.
  67. BLiMP: The benchmark of linguistic minimal pairs for English. Transactions of the Association for Computational Linguistics, 8:377–392.
  68. Towards AI-complete question answering: A set of prerequisite toy tasks. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  69. Brandon T Willard and Rémi Louf. 2023. Efficient guided generation for large language models. arXiv e-prints, pages arXiv–2307.
  70. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  71. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  72. From word models to world models: Translating from natural language to the probabilistic language of thought. arXiv preprint arXiv:2306.12672.
  73. From task structures to world models: What do LLMs know? Trends in Cognitive Sciences.
  74. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 8 tweets and received 17 likes.

Upgrade to Pro to view all of the tweets about this paper: