Measuring General Intelligence with Generated Games: An Overview
The paper presents gg-bench, a novel benchmarking framework for evaluating general intelligence in LLMs through generated game environments. This framework departs from static benchmarks by introducing a dynamic system that continuously produces evaluation instances, leveraging the capabilities of LLMs not only to participate but also to generate complex tasks beyond their reach. The benchmark is designed to assess the adaptability and reasoning capabilities of LLMs, particularly in unforeseen contexts.
Approach and Methodology
gg-bench is created via three primary steps:
- LLMs generate natural language descriptions of unique two-player games.
- The games are implemented as Gym environments using LLM-generated code.
- Reinforcement learning (RL) agents are trained to play these games through self-play.
To evaluate a LLM, the paper employs a straightforward measure: its winrate against trained RL agents when provided with the game description, current state, and valid moves.
Key Findings and Numerical Results
gg-bench presents a formidable challenge to contemporary LLMs. The paper reports that state-of-the-art LLMs like GPT-4o and Claude 3.7 Sonnet achieve winrates between 7-9%, whereas reasoning models such as DeepSeek-R1 and OpenAI's o-series exhibit substantially improved winrates, ranging from 31-36%. These findings underscore the necessity of structured decision-making and long-horizon planning—facets emphasized by reasoning models.
Implications and Future Directions
The dynamic nature of gg-bench promises several advantages. It provides scale; as models evolve, more robust games can be synthesized, preventing benchmark saturation. Additionally, its synthetic foundation ensures control over evaluation, permitting tailored alterations to test specific model attributes. Moreover, by focusing on LLM-generated games, gg-bench addresses the risk of models leveraging memorization from training, testing genuine adaptability.
The paper demonstrates the potential of using self-generated tasks to avoid the pitfalls of static benchmarks which may not reflect genuine domain-independent capabilities. While this approach cannot encompass all aspects of general intelligence—such as social reasoning—it presents a scalable proxy for evaluating model performance in novel and diverse settings.
As AI research progresses, the methodology put forth in this paper hints at future benchmarks' scalability and adaptability, forming a robust framework to assess the evolving landscape of artificial general intelligence.