Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 90 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps (2410.07765v1)

Published 10 Oct 2024 in cs.CL and cs.AI

Abstract: LLMs have recently demonstrated great success in generating and understanding natural language. While they have also shown potential beyond the domain of natural language, it remains an open question as to what extent and in which way these LLMs can plan. We investigate their planning capabilities by proposing GameTraversalBenchmark (GTB), a benchmark consisting of diverse 2D grid-based game maps. An LLM succeeds if it can traverse through given objectives, with a minimum number of steps and a minimum number of generation errors. We evaluate a number of LLMs on GTB and found that GPT-4-Turbo achieved the highest score of 44.97% on GTB_Score (GTBS), a composite score that combines the three above criteria. Furthermore, we preliminarily test large reasoning models, namely o1, which scores $67.84\%$ on GTBS, indicating that the benchmark remains challenging for current models. Code, data, and documentation are available at https://github.com/umair-nasir14/Game-Traversal-Benchmark.

Collections

Summary

The paper presents a novel benchmark, GameTraversalBenchmark (GTB), that rigorously tests large language models’ planning abilities using 2D grid maps.
The methodology employs character-based map representations and evaluates performance with metrics like GTB-Score, Mean Generation Errors, and accuracy rates.
Results show GPT-4-Turbo achieving a GTB-Score of 44.97% versus 67.84% for LRMs, highlighting significant challenges and potential for LLM improvement.

Evaluating LLMs with GameTraversalBenchmark

The paper "GameTraversalBenchmark: Evaluating Planning Abilities Of LLMs Through Traversing 2D Game Maps" offers a rigorous investigation into the planning capabilities of LLMs by introducing a novel benchmark named GameTraversalBenchmark (GTB). This research is significant in exploring the potential of LLMs beyond their traditional domain of natural language processing, exploring their ability to navigate and plan within complex environments.

Benchmark Design and Implementation

The GTB is established using a diverse dataset of 2D grid-based game maps. These maps are designed to challenge an LLM's planning faculties by demanding traversal through objectives with minimal steps and generation errors. The approach leverages character-based map representations, presenting a unique evaluation avenue largely absent from the training datasets of current LLMs. This setup ensures that LLMs cannot rely on simple lookups and must engage in planning to determine optimal paths.

Evaluation metrics within GTB are robust, integrating measures such as GTB-Score, Mean Generation Errors (MGE), Mean Path Length (MPL), and accuracy metrics (Top-0, Top-1, and Top-5 Accuracy). These metrics collectively provide a comprehensive understanding of an LLM's efficiency and accuracy in planning tasks.

Baseline and State-of-the-Art Evaluations

The research evaluates a variety of LLMs on the GTB, with GPT-4-Turbo emerging as the leading performer among traditional LLMs yet achieving a GTB-Score of only 44.97%. GPT-4-Turbo garnered higher scores by generating fewer errors and producing shorter paths compared to its counterparts. However, its performance, like others, remains below 50%, underscoring the substantial challenge posed by the benchmark.

The paper also offers preliminary insights into the capabilities of large reasoning models (LRMs) such as o1, which achieved a GTB-Score of 67.84%. Despite surpassing traditional LLMs, the results suggest that current LLMs, even at their best, have significant room for improvement in planning tasks as defined by GTB.

Implications and Future Directions

GTB introduces a promising framework for evaluating LLMs' planning abilities, but it also opens new avenues for future research. There is potential for enhancing LLM performance on GTB through fine-tuning and investigating the generalization capabilities of fine-tuned models on unseen benchmarks. Future iterations of GTB could integrate dynamic elements, such as moving obstacles or enemies, to further challenge the planning faculties of LLMs. Moreover, exploring variable prompts tailored to LLM capacities could yield insights into optimizing their planning efficiency.

The research, while novel and challenging, encourages a larger conversation about the broader applicability of LLM technologies. By rigorously assessing LLMs in non-traditional domains like game traversal, GTB pushes the boundaries of what is currently achievable and sets a high bar for the development of more versatile and capable AI systems.

Conclusion

The GameTraversalBenchmark represents a significant step in evaluating and understanding the planning capabilities of LLMs. By providing a rigorous and diverse testing ground, this benchmark highlights current limitations and facilitates the development of more advanced LLMs, ultimately contributing to the evolution of general AI models. As research continues, exploring these frontiers will undoubtedly reveal new insights and drive innovation in AI planning capabilities.