Solving and Generating NPR Sunday Puzzles with Large Language Models (2306.12255v1)
Abstract: We explore the ability of LLMs to solve and generate puzzles from the NPR Sunday Puzzle game show using PUZZLEQA, a dataset comprising 15 years of on-air puzzles. We evaluate four LLMs using PUZZLEQA, in both multiple choice and free response formats, and explore two prompt engineering techniques to improve free response performance: chain-of-thought reasoning and prompt summarization. We find that state-of-the-art LLMs can solve many PUZZLEQA puzzles: the best model, GPT-3.5, achieves 50.2% loose accuracy. However, in our few-shot puzzle generation experiment, we find no evidence that models can generate puzzles: GPT-3.5 generates puzzles with answers that do not conform to the generated rules. Puzzle generation remains a challenging task for future work.
- 2020. Language models are few-shot learners.
- Ferrucci, D. A. 2012. Introduction to “This is Watson”. IBM Journal of Research and Development 56(3.4):1:1–1:15.
- 2021. Quizbowl: The case for incremental question answering.
- 2021. Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP. arXiv:2104.08620.
- 2022. Playing Games with Ais: The Limits of GPT-3 and Similar Large Language Models. Minds and Machines 32(2):341–364.
- 2023. LLaMA: Open and efficient foundation language models.
- 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- 2023. Chain-of-thought prompting elicits reasoning in large language models.
- Jingmiao Zhao (2 papers)
- Carolyn Jane Anderson (15 papers)