Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game (2407.11240v1)

Published 15 Jul 2024 in cs.AI and cs.CL

Abstract: The Connections puzzle is a word association game published daily by The New York Times (NYT). In this game, players are asked to find groups of four words that are connected by a common theme. While solving a given Connections puzzle requires both semantic knowledge and abstract reasoning, generating novel puzzles additionally requires a form of metacognition: generators must be able to accurately model the downstream reasoning of potential solvers. In this paper, we investigate the ability of the GPT family of LLMs to generate challenging and creative word games for human players. We start with an analysis of the word game Connections and the unique challenges it poses as a Procedural Content Generation (PCG) domain. We then propose a method for generating Connections puzzles using LLMs by adapting a Tree of Thoughts (ToT) prompting approach. We evaluate this method by conducting a user study, asking human players to compare AI-generated puzzles against published Connections puzzles. Our findings show that LLMs are capable puzzle creators, and can generate diverse sets of enjoyable, challenging, and creative Connections puzzles as judged by human users.

PDF HTML Abstract

LLMs as Generators for The New York Times' Connections Puzzles

The paper "Making New Connections: LLMs as Puzzle Generators for The New York Times' Connections Word Game" presents an exploration into the capabilities of LLMs, specifically the GPT family, in generating novel word puzzles for the New York Times' Connections game. This game requires players to identify groups of four words connected by a common theme, which adds layers of semantic complexity and abstract reasoning to the procedural content generation (PCG) process.

Overview

The authors begin by situating Connections within the broader context of word games and highlight the unique challenges it presents. Unlike traditional games in embodied grid-worlds, Connections puzzles operate directly at the level of language and semantics, making traditional PCG techniques less effective. The paper proposes an iterative approach for using LLMs to generate Connections puzzles, using a Tree of Thoughts (ToT) strategy adapted for this novel domain.

Methodology

Identifying Puzzle Constraints

The methodology starts by identifying fundamental constraints that define valid Connections puzzles. These constraints ensure that each puzzle consists of 16 unique words grouped into four distinct categories, each word used exactly once with no overlap in category themes:

Varied Categories: Categories are thematically distinct.
Unique Names: Category names do not include the words they categorize.
Spelling Matters: Words must fit into their categories with the given spelling.

Generative Pipeline

The generative pipeline consists of a three-step process: a puzzle creator, a puzzle editor, and a human evaluator. The puzzle creator is an LLM (GPT-4) that generates word pools and categories through an iterative approach. The puzzle editor, another instance of GPT-4, corrects any inaccuracies in the category names. Finally, human evaluators assess the overall quality and difficulty of the generated puzzles in a user paper.

Two main types of difficulty-inducing strategies are employed:

Intentional Overlap: Where words from one category are deliberately included in another under a different semantic meaning.
False Groups: Where a seemingly valid but incorrect group of related words (the "false group") is included to mislead players.

Evaluating Difficulty

To objectively evaluate the difficulty of word groups, the authors use a cosine similarity metric based on word embeddings from the MPNet model. This metric helps categorize word groups into different difficulty levels (yellow, green, blue, purple) by measuring intra-group semantic similarity.

User Study

To evaluate the effectiveness of the generated puzzles, a user paper was conducted. Participants played both AI-generated puzzles and puzzles published by the New York Times. They were asked to rate the puzzles based on creativity, difficulty, and overall enjoyment.

Results

The user paper revealed that AI-generated puzzles, especially those using the iterative approach with intentional overlap or LLM-generated false groups, were often rated as comparable to NYT puzzles in terms of difficulty. However, while the one-step generated puzzles were frequently seen as less creative and enjoyable, the iterative methods closed this gap significantly. Notably:

Intentional Overlap Puzzles were deemed more difficult in 60% of comparisons to NYT puzzles.
LLM-generated False Group Puzzles were well-received, with users finding them equally or more enjoyable in a significant number of cases.

Discussion and Future Work

Difficulty Discrepancy

The paper identified a discrepancy in difficulty levels between LLM-generated false group puzzles and those seeded with false groups from NYT puzzles. The key factor was the semantic similarity within the false groups, affecting player misdirection efficiency.

Game Generation in PCG Domain

Connections puzzles present unique challenges that require both semantic understanding and creative category generation. Although the generated puzzles were competitive in terms of difficulty and sometimes user preference, achieving the ideal balance of challenge and creativity remains a complex task.

Future Directions

Future work should explore:

Enhanced prompts to better capture subtlety and creativity.
Integration of human design inputs to hybridize human and AI creativity.
Broadened testing to include longitudinal studies on repeated exposure to AI-generated puzzles.

Conclusion

The paper demonstrates that LLMs, particularly GPT-4, can generate viable, and often competitive, Connections puzzles by leveraging iterative prompting and embedding-based difficulty metrics. This presents a promising direction for both standalone PCG applications and as tools that augment human designers in creating complex, enjoyable word puzzles. Further refinement and integration could pave the way for more sophisticated semantic puzzle generation in various gaming and educational contexts.