Evaluating Abstract Reasoning Capabilities of LLMs Through the NYT Connections Word Game
The paper "Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game" by Samadarshi et al. provides a comprehensive evaluation of LLMs by leveraging the New York Times' Connections word game as a benchmark. The paper aims to measure the abstract reasoning abilities of state-of-the-art LLMs and compare their performance against human players of both novice and expert levels.
Overview and Results
The authors collected 200 distinct Connections games, which present a 4x4 grid of 16 words that players must group into four categories based on shared characteristics. They evaluated four prominent LLMs: Google's Gemini 1.5 Pro, Anthropic's Claude 3 Opus, OpenAI's GPT-4-Omni (referred to as GPT-4o), and Meta's Llama 3 70B. The performance of these models was compared to that of 12 novice and five expert human players.
The abstract reasoning task proved to be difficult for all LLMs. GPT-4o, the best-performing model, fully solved only 8% of the games. In comparison, novice and expert human players demonstrated superior performance, with experts significantly outperforming even the best LLM.
Taxonomy of Knowledge Types
To understand the underlying challenges of the Connections game, the authors developed a taxonomy of the knowledge types required to solve the puzzles. The taxonomy includes:
- Semantic Knowledge: This encompasses lexical semantics such as synonymy, hypernymy, and polysemy.
- Associative Knowledge: This involves connotative meanings and shared properties that connect words.
- Encyclopedic Knowledge: Knowledge about various real-world entities and concepts that span beyond general semantics.
- Multiword Expressions: These are idiomatic or compositional expressions that require recognizing complex word constructs.
- Linguistic Knowledge: Understanding morphology, phonology, or orthography to categorize words.
- Combined Knowledge: Categories that require multiple types of knowledge simultaneously.
Scoring and Evaluation
The paper employs multiple scoring schemas to evaluate performance:
- Unweighted Clustering Score: This simply counts the correct groupings.
- Weighted Clustering Score: This considers the difficulty levels associated with each category.
- Categorical Reasoning Score: This evaluates not just the correct grouping but whether the reasoning provided by LLMs matches the intended category.
Comparative Analysis
The results show that novice players achieved an average clustering score of 1.38, slightly better than GPT-4o's 1.17 on the same set of 100 games. Expert players, however, had an average score of 3, significantly outperforming GPT-4o's score of 1.22 on the corresponding 50 games.
LLMs generally performed best with Semantic Knowledge tasks and struggled with Multiword Expressions and Combined Knowledge categories. Distractors or "red herrings" significantly added to the difficulty, causing LLMs to misclassify categories due to overlapping word associations.
Implications and Future Developments
The paper highlights significant gaps in the abstract reasoning capabilities of even the most advanced LLMs when faced with complex, multi-faceted puzzles like the Connections game. These findings suggest that while LLMs have made substantial progress in various NLP tasks, their ability to reason abstractly, particularly in the presence of distractors and multi-dimensional categories, remains limited.
Future work could involve integrating retrieval-augmented models that can draw on extensive external knowledge bases like WordNet or specialized dictionaries. Additionally, training LLMs on a dataset specifically tailored to the Connections game might improve their performance. Strategies such as simulating step-by-step reasoning and incorporating feedback mechanisms akin to human gameplay could be beneficial.
Conclusion
The paper establishes the New York Times Connections game as a robust benchmark for evaluating abstract reasoning in LLMs. It reveals that current models, although proficient in some areas of reasoning, are still no match for expert human players and highlights specific areas where these models can be improved. The findings have significant theoretical and practical implications for the development of more adept AI systems capable of nuanced and abstract problem-solving.