Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game (2406.11012v7)

Published 16 Jun 2024 in cs.CL and cs.AI

Abstract: The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 438 Connections games to evaluate the performance of state-of-the-art LLMs against expert and novice human players. Our results show that even the best performing LLM, Claude 3.5 Sonnet, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 18% of the games. Novice and expert players perform better than Claude 3.5 Sonnet, with expert human players significantly outperforming it. We create a taxonomy of the knowledge types required to successfully cluster and categorize words in the Connections game. We find that while LLMs perform relatively well on categorizing words based on semantic relations they struggle with other types of knowledge such as Encyclopedic Knowledge, Multiword Expressions or knowledge that combines both Word Form and Meaning. Our results establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in AI systems.

PDF HTML Abstract

Evaluating Abstract Reasoning Capabilities of LLMs Through the NYT Connections Word Game

The paper "Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game" by Samadarshi et al. provides a comprehensive evaluation of LLMs by leveraging the New York Times' Connections word game as a benchmark. The paper aims to measure the abstract reasoning abilities of state-of-the-art LLMs and compare their performance against human players of both novice and expert levels.

Overview and Results

The authors collected 200 distinct Connections games, which present a 4x4 grid of 16 words that players must group into four categories based on shared characteristics. They evaluated four prominent LLMs: Google's Gemini 1.5 Pro, Anthropic's Claude 3 Opus, OpenAI's GPT-4-Omni (referred to as GPT-4o), and Meta's Llama 3 70B. The performance of these models was compared to that of 12 novice and five expert human players.

The abstract reasoning task proved to be difficult for all LLMs. GPT-4o, the best-performing model, fully solved only 8% of the games. In comparison, novice and expert human players demonstrated superior performance, with experts significantly outperforming even the best LLM.

Taxonomy of Knowledge Types

To understand the underlying challenges of the Connections game, the authors developed a taxonomy of the knowledge types required to solve the puzzles. The taxonomy includes:

Semantic Knowledge: This encompasses lexical semantics such as synonymy, hypernymy, and polysemy.
Associative Knowledge: This involves connotative meanings and shared properties that connect words.
Encyclopedic Knowledge: Knowledge about various real-world entities and concepts that span beyond general semantics.
Multiword Expressions: These are idiomatic or compositional expressions that require recognizing complex word constructs.
Linguistic Knowledge: Understanding morphology, phonology, or orthography to categorize words.
Combined Knowledge: Categories that require multiple types of knowledge simultaneously.

Scoring and Evaluation

The paper employs multiple scoring schemas to evaluate performance:

Unweighted Clustering Score: This simply counts the correct groupings.
Weighted Clustering Score: This considers the difficulty levels associated with each category.
Categorical Reasoning Score: This evaluates not just the correct grouping but whether the reasoning provided by LLMs matches the intended category.

Comparative Analysis

The results show that novice players achieved an average clustering score of 1.38, slightly better than GPT-4o's 1.17 on the same set of 100 games. Expert players, however, had an average score of 3, significantly outperforming GPT-4o's score of 1.22 on the corresponding 50 games.

LLMs generally performed best with Semantic Knowledge tasks and struggled with Multiword Expressions and Combined Knowledge categories. Distractors or "red herrings" significantly added to the difficulty, causing LLMs to misclassify categories due to overlapping word associations.

Implications and Future Developments

The paper highlights significant gaps in the abstract reasoning capabilities of even the most advanced LLMs when faced with complex, multi-faceted puzzles like the Connections game. These findings suggest that while LLMs have made substantial progress in various NLP tasks, their ability to reason abstractly, particularly in the presence of distractors and multi-dimensional categories, remains limited.

Future work could involve integrating retrieval-augmented models that can draw on extensive external knowledge bases like WordNet or specialized dictionaries. Additionally, training LLMs on a dataset specifically tailored to the Connections game might improve their performance. Strategies such as simulating step-by-step reasoning and incorporating feedback mechanisms akin to human gameplay could be beneficial.

Conclusion

The paper establishes the New York Times Connections game as a robust benchmark for evaluating abstract reasoning in LLMs. It reveals that current models, although proficient in some areas of reasoning, are still no match for expert human players and highlights specific areas where these models can be improved. The findings have significant theoretical and practical implications for the development of more adept AI systems capable of nuanced and abstract problem-solving.