Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TextGames: Learning to Self-Play Text-Based Puzzle Games via Language Model Reasoning (2502.18431v1)

Published 25 Feb 2025 in cs.CL and cs.AI

Abstract: Reasoning is a fundamental capability of LLMs, enabling them to comprehend, analyze, and solve complex problems. In this paper, we introduce TextGames, an innovative benchmark specifically crafted to assess LLMs through demanding text-based games that require advanced skills in pattern recognition, spatial awareness, arithmetic, and logical reasoning. Our analysis probes LLMs' performance in both single-turn and multi-turn reasoning, and their abilities in leveraging feedback to correct subsequent answers through self-reflection. Our findings reveal that, although LLMs exhibit proficiency in addressing most easy and medium-level problems, they face significant challenges with more difficult tasks. In contrast, humans are capable of solving all tasks when given sufficient time. Moreover, we observe that LLMs show improved performance in multi-turn predictions through self-reflection, yet they still struggle with sequencing, counting, and following complex rules consistently. Additionally, models optimized for reasoning outperform pre-trained LLMs that prioritize instruction following, highlighting the crucial role of reasoning skills in addressing highly complex problems.

Analysis of "blackblack: Learning to Self-Play Text-Based Puzzle Games via LLM Reasoning"

The paper introduces a novel benchmark named "blackblack," which serves to evaluate the reasoning capabilities of LLMs in solving text-based puzzle games. This benchmark comprises a comprehensive set of eight puzzle games that are specifically designed to test various reasoning skills, including pattern recognition, spatial awareness, arithmetic, and logical reasoning. The need for such a benchmark stems from the growing importance of evaluating LLMs beyond traditional language comprehension tasks, as reasoning remains a critical challenge for these models.

Objectives and Methodology

The primary goal of the paper is to assess LLMs' abilities to solve text-based games that require advanced reasoning. The benchmark includes games such as Anagram Scribble, Password Game, Bracket Game, String Search, Crossword Arranger, Text Sudoku, Islands, and Ordering Text, each with three levels of difficulty: Easy, Medium, and Hard. These games are chosen for their ability to represent diverse reasoning challenges that LLMs might encounter in practical applications.

To evaluate LLMs, the authors developed an evaluation framework that uses a combination of model-generated solutions and graders to verify their correctness. The framework also implements a multi-turn prompting strategy, where models iteratively refine their responses based on feedback provided after each turn. This approach mimics a feedback loop, enabling self-reflection and potentially improved problem-solving by the models across multiple attempts.

Results and Observations

The results are enlightening. Despite the capacity of LLMs to handle easy and medium difficulty tasks with a decent success rate, their performance drops significantly when faced with complex problems at the hard level. Notably, this contrast is more evident when juxtaposed with human participants, who could, with adequate time, solve all given tasks regardless of difficulty. The paper reports that large models, such as those with 70B or more parameters, perform better than their smaller counterparts on simpler tasks. However, they still face significant challenges with more difficult problems that demand comprehensive multi-faceted reasoning approaches.

Another crucial finding is the impact of model training focus. Results demonstrate that models optimized specifically for reasoning tasks outperform those tuned primarily for instruction following. Furthermore, the use of multi-turn feedback yielded improved results, as models seemed to benefit from self-reflection and iterative error correction—although the degree of this improvement varied among the different tasks.

Implications and Future Directions

The introduction of the blackblack benchmark provides a robust platform to rigorously test and understand the reasoning capabilities of LLMs, highlighting their current limitations as well as progress. Practically, this benchmark can guide further optimization and development of LLMs tailored for real-world applications requiring complex problem-solving and reasoning.

From a theoretical standpoint, these findings stress the importance of specialized training for LLMs that aligns closely with reasoning tasks rather than general instruction following. They also underline the potential value of internal feedback mechanisms within LLMs, akin to human cognitive processes that refine understanding based on past errors.

Future developments in artificial intelligence, particularly in enhancing the reasoning faculties of LLMs, would benefit from incorporating insights gained from the blackblack benchmark. Such benchmarks can continue to evolve and expand, potentially including more diverse forms of reasoning tasks and contexts, to further push the boundaries of LLM capabilities and applications in sophisticated problem-solving tasks.

In conclusion, this research provides a meaningful and rigorous evaluation of LLMs in reasoning-intensive applications, highlighting significant areas for improvement and innovation. As we move forward, such benchmarks will be invaluable in shaping the direction and effectiveness of next-generation AI systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Frederikus Hudi (6 papers)
  2. Genta Indra Winata (94 papers)
  3. Ruochen Zhang (21 papers)
  4. Alham Fikri Aji (94 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com