modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models (2406.17038v1)
Abstract: We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for LLMs, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source LLMs and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.
- Nathan A. Chi (3 papers)
- Teodor Malchev (1 paper)
- Riley Kong (2 papers)
- Ryan A. Chi (3 papers)
- Lucas Huang (3 papers)
- Ethan A. Chi (8 papers)
- R. Thomas McCoy (33 papers)
- Dragomir Radev (98 papers)