Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Missed Connections: Lateral Thinking Puzzles for Large Language Models (2404.11730v2)

Published 17 Apr 2024 in cs.CL and cs.AI

Abstract: The Connections puzzle published each day by the New York Times tasks players with dividing a bank of sixteen words into four groups of four words that each relate to a common theme. Solving the puzzle requires both common linguistic knowledge (i.e. definitions and typical usage) as well as, in many cases, lateral or abstract thinking. This is because the four categories ascend in complexity, with the most challenging category often requiring thinking about words in uncommon ways or as parts of larger phrases. We investigate the capacity for automated AI systems to play Connections and explore the game's potential as an automated benchmark for abstract reasoning and a way to measure the semantic information encoded by data-driven linguistic systems. In particular, we study both a sentence-embedding baseline and modern LLMs. We report their accuracy on the task, measure the impacts of chain-of-thought prompting, and discuss their failure modes. Overall, we find that the Connections task is challenging yet feasible, and a strong test-bed for future work.

Lateral Thinking Challenges for LLMs: An Examination of "Missed Connections"

Introduction

"Missed Connections: Lateral Thinking Puzzles for LLMs" investigates automated AI systems' capability to solve the Connections puzzle. This puzzle, published daily by the New York Times, demands not only semantic understanding but abstract reasoning, making it a robust benchmark for studying LLMs and other NLP systems. The puzzle's increasing complexity from simple to tricky categories requires identifying thematic links among words, showcasing the nuanced understanding and flexible reasoning required to tackle it.

Methodology

Game Setup and Variants: The authors describe the standard Connections puzzle composed of a grid of sixteen words which must be categorized into four related groups. A Python interface replicates the game, including its feedback mechanisms and an enhanced challenging variant for rigorous evaluation.

Data Collection: The paper discusses the assembly of 250 puzzles sourced from an online archive covering a nearly eight-month period. These puzzles serve as the evaluation dataset, with no division into training or validation sets, preserving all puzzles for testing the models' innate capabilities.

Approaches

The paper evaluates two primary approaches:

  1. Sentence Embeddings: Utilizes high-dimensional vectors to capture semantic information, applying models like BERT, RoBERTa, MPNet, and MiniLM. The method involves calculating cosine similarities between word embeddings to predict thematic groups.
  2. LLMs: Specifically, the GPT family. These models receive a detailed prompt including game instructions and current game state, tasked with predicting correct word groups. The authors also explore the impact of adjusting prompts to include chain-of-thought processes, encouraging models to justify their reasoning step-by-step.

Experiments and Results

Baseline Evaluations: The baseline sentence embedding models, particularly MPNet, show that semantic vectors can weakly represent the connections, albeit not as effectively as human solvers. MPNet was able to solve all 150 puzzles within 417 guesses.

LLM Performance: The GPT models, especially GPT-4, performed better than the sentence embeddings in standard settings. GPT-4 significantly outperformed GPT-3.5, and chain-of-thought prompting further improved its accuracy.

Challenge Variant: Testing on a more challenging variant of the puzzle where all guesses must be submitted simultaneously showed that this setup significantly increases difficulty, with notable drops in success rates, particularly when chain-of-thought prompting was used.

Discussion

Semantic Understanding and Abstract Reasoning: The research highlights areas where LLMs struggle, such as non-semantic word properties and context-dependent usages. Despite the challenges, models like GPT-4 show promising capabilities but still fall short of human-level flexibility and insight, particularly in lateral thinking and abstract reasoning.

Chain-of-Thought Prompting: This technique considerably enhances model performance by structuring its reasoning process, which could be synonymous with internal 'thinking' strategies potentially similar to human problem-solving approaches.

Future Developments in AI

Looking forward, there are several pathways for further research:

  • Improving Solver Performance: Utilizing dedicated training data or iterative refinement of prompts could improve accuracy.
  • Integrating Explicit Knowledge Bases: Combining LLMs with comprehensive databases like WordNet could enrich the models' semantic understanding.
  • Puzzle Generation: Exploring LLMs' potential to not only solve but create engaging and complex puzzles could extend their application to creative domains.
  • Human vs. LLM Puzzle-Solving Strategies: Comparative studies could uncover fundamental differences in problem-solving approaches and cognitive processes between humans and models.

The paper establishes the Connections puzzle as a meaningful benchmark for advancing and evaluating the reasoning capabilities of automated systems, laying a foundation for future explorations into the cognitive-like processes of AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.   Association for Computational Linguistics, 11 2019. [Online]. Available: http://arxiv.org/abs/1908.10084
  2. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  3. A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, “Large language models in medicine,” Nature medicine, vol. 29, no. 8, pp. 1930–1940, 2023.
  4. X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang, “Agentbench: Evaluating llms as agents,” 2023.
  5. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  6. P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521, 2022.
  7. M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou et al., “Challenging big-bench tasks and whether chain-of-thought can solve them,” arXiv preprint arXiv:2210.09261, 2022.
  8. T. Webb, K. J. Holyoak, and H. Lu, “Emergent analogical reasoning in large language models,” Nature Human Behaviour, vol. 7, no. 9, pp. 1526–1541, 2023.
  9. G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An open-ended embodied agent with large language models,” arXiv preprint arXiv:2305.16291, 2023.
  10. C. F. Tsai, X. Zhou, S. S. Liu, J. Li, M. Yu, and H. Mei, “Can large language models play text games well? current state-of-the-art and open questions,” arXiv preprint arXiv:2304.02868, 2023.
  11. M. F. A. R. D. T. (FAIR)†, A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu et al., “Human-level play in the game of diplomacy by combining language models with strategic reasoning,” Science, vol. 378, no. 6624, pp. 1067–1074, 2022.
  12. D. Noever, M. Ciolino, and J. Kalin, “The chess transformer: Mastering play using generative language models,” arXiv preprint arXiv:2008.04057, 2020.
  13. M. Ciolino, J. Kalin, and D. Noever, “The go transformer: natural language modeling for game play,” in 2020 Third International Conference on Artificial Intelligence for Industries (AI4I).   IEEE, 2020, pp. 23–26.
  14. J. Urbanek, A. Fan, S. Karamcheti, S. Jain, S. Humeau, E. Dinan, T. Rocktäschel, D. Kiela, A. Szlam, and J. Weston, “Learning to speak and act in a fantasy text adventure game,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 673–683.
  15. J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” in Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, 2023, pp. 1–22.
  16. A. Summerville and M. Mateas, “Super mario as a string: Platformer level generation via lstms,” arXiv preprint arXiv:1603.00930, 2016.
  17. S. Sudhakaran, M. González-Duque, C. Glanois, M. Freiberger, E. Najarro, and S. Risi, “Prompt-guided level generation,” in Proceedings of the Companion Conference on Genetic and Evolutionary Computation, 2023, pp. 179–182.
  18. G. Todd, S. Earle, M. U. Nasir, M. C. Green, and J. Togelius, “Level generation through large language models,” in Proceedings of the 18th International Conference on the Foundations of Digital Games, 2023, pp. 1–8.
  19. R. Wang, G. Todd, E. Yuan, Z. Xiao, M.-A. Côté, and P. Jansen, “Bytesized32: A corpus and challenge task for generating task-specific world models expressed as text games,” arXiv preprint arXiv:2305.14879, 2023.
  20. C. Jaramillo, M. Charity, R. Canaan, and J. Togelius, “Word autobots: Using transformers for word association in the game codenames,” Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 16, no. 1, pp. 231–237, Oct. 2020. [Online]. Available: https://ojs.aaai.org/index.php/AIIDE/article/view/7435
  21. J. Zhao and C. J. Anderson, “Solving and generating npr sunday puzzles with large language models,” arXiv preprint arXiv:2306.12255, 2023.
  22. J. Rozner, C. Potts, and K. Mahowald, “Decrypting cryptic crosswords: Semantically complex wordplay puzzles as a target for nlp,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 409–11 421, 2021.
  23. S. Perowne and M. Iancu, “Can chatgpt solve nyt’s connections?” https://crossword-solver.io/chatgpt-vs-connections, accessed: 2023-11-30.
  24. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  25. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  26. K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mpnet: Masked and permuted pre-training for language understanding,” 2020.
  27. W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” 2020.
  28. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  29. OpenAI, “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  30. C. Fellbaum, “Wordnet,” in Theory and applications of ontology: computer applications.   Springer, 2010, pp. 231–243.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Graham Todd (9 papers)
  2. Tim Merino (3 papers)
  3. Sam Earle (25 papers)
  4. Julian Togelius (154 papers)
Citations (4)