KORGym: Evaluation Platform for Model Reasoning
- KORGym is a dynamic, game-based evaluation platform that isolates reasoning from memorized knowledge through tailored, multi-turn games.
- It integrates a standardized Gym-style API and modular architecture to support both text-based and visual reasoning assessments.
- The platform enables reinforcement learning experiments with rigorous reward schemes for quantitative, cross-model performance comparisons.
KORGym (Knowledge Orthogonal Reasoning Gymnasium) is a dynamic, game-based evaluation platform specifically designed to probe the intrinsic reasoning capabilities of LLMs and vision–LLMs (VLMs). By decoupling reasoning from memorized knowledge—termed “knowledge orthogonality”—KORGym employs a suite of both text-based and visual games encapsulated within a standardized Gym-style API. This mitigates the deficiencies of prior benchmarks, including narrow domain coverage, single-turn interaction, and contamination from training data. The platform facilitates rigorous, multi-turn, and reinforcement learning-based assessments of reasoning.
1. Motivation and Theoretical Framework
Traditional benchmarks used in LLM and VLM evaluation are often domain-specific (e.g., AIME for arithmetic, PHYBench for physics), thereby capturing only narrow subsets of cognitive skills and being susceptible to data memorization effects. Even broader benchmarks such as SuperGPQA and HLE conflate recall with reasoning and therefore fail to isolate core deductive, abductive, and planning capacities (Shi et al., 20 May 2025).
KORGym draws methodological inspiration from two sources:
- KOR-Bench: Introduces “knowledge orthogonality,” framing tasks so that solution correctness depends strictly on explicit, self-contained rule-sets () rather than background world knowledge (). A task is knowledge-orthogonal if , , and .
- Gymnasium API: Extends the step(), reset(), render(), and reward() abstractions for seamless integration with RL workflows. This standardizes task representation for both single- and multi-turn reasoning scenarios.
The use of structured games addresses several desiderata: they encompass plentiful out-of-distribution instances, necessitate sequential decision-making and long-term planning, and feature well-defined reward signals suitable for quantitative assessment.
2. System Components and Architectural Design
KORGym’s software architecture consists of four modular subsystems:
- Inference Module: Orchestrates model invocations, handles batching and asynchronous execution, and checkpointing of intermediate reasoning traces.
- Game Interaction Module: Encodes game logic/state and provides three principal APIs: generate(seed, difficulty), print_board(state) for human- and model-facing prompts, and verify(state, action) for transition/reward logic.
- Evaluation & Communication Module: Parses user parameters, manages inter-module communications, aggregates results, and logs final metrics.
- Scoring Module: Implements binary ( iff goal reached), proportional, or cumulative reward schemes. To harmonize heterogeneous game scores, it applies the “Capability-Dimension Aggregated Mean” normalization:
are the per-game minimum/maximum, and . For a reasoning dimension , ensures cross-task comparability.
This modular design enables multi-turn, reinforcement-learning, and both textual and multimodal evaluation paradigms with reproducible, comparative metrics (Shi et al., 20 May 2025).
3. Game Suite and Reasoning Dimensions
KORGym includes a portfolio of 51 games spanning six cognitive skill dimensions, detailed as follows:
| Category | Examples | Reasoning Modes |
|---|---|---|
| Mathematical | Date Calculation, Sudoku | Deductive, Algorithmic |
| Puzzle/Logic | 8-Puzzle, Maze Solving, Eulerian-path (One-Stroke) | Abductive, Spatial/Geometric |
| Language | Wordle, Crypto Word, Letter Connection | Deductive, Natural-Language |
| Control | Tower of Hanoi, Numeric Bricks | Sequential Control/Planning |
| Strategic | 2048, N-Point, Evolution of Trust | Strategic, Multi-step Planning |
| Visual | Jigsaw Puzzle, Find the Pattern, Visual Sokoban | Multimodal, Visual Grounding |
Textual games (~42) are complemented by visual/multimodal games (9) that require integrating image parsing and internal representation prior to strategic reasoning. The tasks target deductive, abductive, spatial, planning, and multimodal reasoning. This functional diversity enables profiling of differential model strengths and weaknesses along distinct cognitive axes.
4. Interactive Protocols and Reinforcement-Learning Integration
Each KORGym game is modeled as a Markov Decision Process (MDP) defined by state space , action space , transition function , reward function , and . Game interaction proceeds in episodes with up to 100 steps and at least 20 independent seeds per model:
- At timestep : (state), prompt print_board(), (model action), verify(, ).
- Reinforcement learning support includes RL fine-tuning pipelines: Doubao-1.5-thinking-pro leverages DAPO/VAPO algorithms on selected games to optimize policy robustness and cross-seed variance.
- The API is compatible with standardized RL toolchains, supporting batch evaluation and scoring under diverse reward schemes.
This alignment with RL and sequential interactive settings distinguishes KORGym from primarily single-turn, static benchmarks.
5. Experimental Findings: Model Capabilities and Benchmark Insights
Large-scale experiments encompassed 19 LLMs and 8 VLMs, providing several empirical findings:
- Leaderboard: OpenAI O3-mini attains 82% on the Capability Dimension Aggregated Mean; Gemini-2.5-pro, 79%; Doubao-1.5-thinking-pro, 72%; DeepSeek-R1, 71%. Open-source non-thinking LLMs lag at 8–16% (Shi et al., 20 May 2025).
- Modality Effects: Consistently higher scores for text-only tasks; however, VLMs such as Gemini-2.5-pro occasionally surpass LLMs in visual subtasks (e.g., Jigsaw, Visual Wordle), indicating superior vision–language integration.
- Within-Family Consistency: O1/O3-mini models excel at spatial/geometric games; Gemini series at puzzle/mathematical tasks.
- Model Scale and Tuning: Larger models outperform smaller; “thinking” variants (e.g., Claude-thinking, Doubao-thinking) achieve gains over instruct-tuned baselines.
- Reasoning Paradigms: Annotation reveals use of code, mathematical, algorithm-specific, and natural-language reasoning. Disabling mathematical reasoning leads to the steepest performance drop; code ablation can occasionally improve outcomes by forcing non-code reasoning. High-performing models maintain robustness across ablations.
- Reinforcement Learning Impact: RL fine-tuning empirically improves performance (Doubao-1.5-thinking-pro achieves 72% and low cross-seed variance).
- Response Length: Longer generated answers correlate with higher scores, with minimal incremental gain beyond approximately 200 tokens.
These results reinforce the need for knowledge-orthogonal, multi-turn, and modality-diverse evaluation strategies to faithfully characterize model reasoning.
6. Usage, Extensibility, and Best Practices
KORGym is distributed as both a Python package (pip install korgym) and as source at https://github.com/multimodal-art-projection/KORGym. The interface abstracts game environments compatible with traditional Gym workflow:
1 2 3 4 5 6 7 |
import korgym env = korgym.make("Maze-Text", seed=42) state = env.reset() while not done: prompt = env.print_board() action = model.generate(prompt) state, reward, done = env.step(action) |
Custom games are defined by subclassing korgym.core.BaseGame and providing implementations for generate(seed, difficulty), print_board(state), and verify(state, action). Games are registered via korgym.register("MyGame-v0", entry_point=...). Reward schemes (binary, proportional, cumulative, custom) are assigned via the BaseGame interface.
Best practice recommendations include:
- Isolate reasoning by employing zero-shot prompts to minimize chain-of-thought prior influence.
- Standardize model sampling parameters (temperature, top-p) for reproducibility.
- Evaluate with multiple random seeds (≥20) to counteract stochasticity.
- Apply dimension-wise normalization for fair aggregation across heterogeneous games.
7. Limitations and Future Research Directions
KORGym’s current scope provides robust, multi-modal, and RL-capable evaluation of LLM reasoning disentangled from background knowledge. Identified limitations and proposed avenues include:
- Extending to games with rich social interaction or explicit opponent modeling, while controlling for extraneous variability.
- Incorporating dynamic difficulty adjustment mechanics to extend the diagnostic power of failure threshold probing.
- Advancing toward human-in-the-loop assessment for negotiation, collaboration, and open-ended interactive tasks.
KORGym thus forms a reproducible, extensible benchmark that addresses the fundamental deficits of prior LLM and VLM assessment methodologies and offers a comprehensive empirical yardstick for evolving reasoning research and model development (Shi et al., 20 May 2025).