Papers
Topics
Authors
Recent
2000 character limit reached

KORGym: Evaluation Platform for Model Reasoning

Updated 28 November 2025
  • KORGym is a dynamic, game-based evaluation platform that isolates reasoning from memorized knowledge through tailored, multi-turn games.
  • It integrates a standardized Gym-style API and modular architecture to support both text-based and visual reasoning assessments.
  • The platform enables reinforcement learning experiments with rigorous reward schemes for quantitative, cross-model performance comparisons.

KORGym (Knowledge Orthogonal Reasoning Gymnasium) is a dynamic, game-based evaluation platform specifically designed to probe the intrinsic reasoning capabilities of LLMs and vision–LLMs (VLMs). By decoupling reasoning from memorized knowledge—termed “knowledge orthogonality”—KORGym employs a suite of both text-based and visual games encapsulated within a standardized Gym-style API. This mitigates the deficiencies of prior benchmarks, including narrow domain coverage, single-turn interaction, and contamination from training data. The platform facilitates rigorous, multi-turn, and reinforcement learning-based assessments of reasoning.

1. Motivation and Theoretical Framework

Traditional benchmarks used in LLM and VLM evaluation are often domain-specific (e.g., AIME for arithmetic, PHYBench for physics), thereby capturing only narrow subsets of cognitive skills and being susceptible to data memorization effects. Even broader benchmarks such as SuperGPQA and HLE conflate recall with reasoning and therefore fail to isolate core deductive, abductive, and planning capacities (Shi et al., 20 May 2025).

KORGym draws methodological inspiration from two sources:

  • KOR-Bench: Introduces “knowledge orthogonality,” framing tasks so that solution correctness depends strictly on explicit, self-contained rule-sets (RR) rather than background world knowledge (KK). A task TT is knowledge-orthogonal if RKR \perp K, P(QAR,K)P(QAR)P(QAK)P(Q \to A \mid R, K) \approx P(Q \to A \mid R) \gg P(Q \to A \mid K), and β=[P(QAR,K)P(QAR)]/P(QAR)0\beta = [P(Q \to A \mid R, K) - P(Q \to A \mid R)] / P(Q \to A \mid R) \approx 0.
  • Gymnasium API: Extends the step(), reset(), render(), and reward() abstractions for seamless integration with RL workflows. This standardizes task representation for both single- and multi-turn reasoning scenarios.

The use of structured games addresses several desiderata: they encompass plentiful out-of-distribution instances, necessitate sequential decision-making and long-term planning, and feature well-defined reward signals suitable for quantitative assessment.

2. System Components and Architectural Design

KORGym’s software architecture consists of four modular subsystems:

  • Inference Module: Orchestrates model invocations, handles batching and asynchronous execution, and checkpointing of intermediate reasoning traces.
  • Game Interaction Module: Encodes game logic/state and provides three principal APIs: generate(seed, difficulty), print_board(state) for human- and model-facing prompts, and verify(state, action) for transition/reward logic.
  • Evaluation & Communication Module: Parses user parameters, manages inter-module communications, aggregates results, and logs final metrics.
  • Scoring Module: Implements binary (r=1r = 1 iff goal reached), proportional, or cumulative reward schemes. To harmonize heterogeneous game scores, it applies the “Capability-Dimension Aggregated Mean” normalization:

Sg,m={ln(1+Sg,m)if maxmSg,m>1 Sg,motherwiseS'_{g,m} = \begin{cases} \ln(1+S_{g,m}) & \text{if } \max_m S_{g,m} > 1 \ S_{g,m} & \text{otherwise} \end{cases}

ag,bga_g, b_g are the per-game minimum/maximum, and S~g,m=(Sg,mag)/(bgag)\widetilde S_{g,m} = (S'_{g,m} - a_g) / (b_g - a_g). For a reasoning dimension dd, Sd,m=1GdgGdS~g,m\overline S_{d,m} = \frac{1}{|G_d|} \sum_{g \in G_d} \widetilde S_{g,m} ensures cross-task comparability.

This modular design enables multi-turn, reinforcement-learning, and both textual and multimodal evaluation paradigms with reproducible, comparative metrics (Shi et al., 20 May 2025).

3. Game Suite and Reasoning Dimensions

KORGym includes a portfolio of 51 games spanning six cognitive skill dimensions, detailed as follows:

Category Examples Reasoning Modes
Mathematical Date Calculation, Sudoku Deductive, Algorithmic
Puzzle/Logic 8-Puzzle, Maze Solving, Eulerian-path (One-Stroke) Abductive, Spatial/Geometric
Language Wordle, Crypto Word, Letter Connection Deductive, Natural-Language
Control Tower of Hanoi, Numeric Bricks Sequential Control/Planning
Strategic 2048, N-Point, Evolution of Trust Strategic, Multi-step Planning
Visual Jigsaw Puzzle, Find the Pattern, Visual Sokoban Multimodal, Visual Grounding

Textual games (~42) are complemented by visual/multimodal games (9) that require integrating image parsing and internal representation prior to strategic reasoning. The tasks target deductive, abductive, spatial, planning, and multimodal reasoning. This functional diversity enables profiling of differential model strengths and weaknesses along distinct cognitive axes.

4. Interactive Protocols and Reinforcement-Learning Integration

Each KORGym game is modeled as a Markov Decision Process (MDP) defined by state space SS, action space AA, transition function T(s,a)T(s, a), reward function R(s,a)R(s, a), and γ=1\gamma = 1. Game interaction proceeds in episodes with up to 100 steps and at least 20 independent seeds per model:

  • At timestep tt: sts_t (state), prompt \leftarrow print_board(sts_t), ata_t (model action), (st+1,rt,done)(s_{t+1}, r_t, \text{done}) \leftarrow verify(sts_t, ata_t).
  • Reinforcement learning support includes RL fine-tuning pipelines: Doubao-1.5-thinking-pro leverages DAPO/VAPO algorithms on selected games to optimize policy robustness and cross-seed variance.
  • The API is compatible with standardized RL toolchains, supporting batch evaluation and scoring under diverse reward schemes.

This alignment with RL and sequential interactive settings distinguishes KORGym from primarily single-turn, static benchmarks.

5. Experimental Findings: Model Capabilities and Benchmark Insights

Large-scale experiments encompassed 19 LLMs and 8 VLMs, providing several empirical findings:

  • Leaderboard: OpenAI O3-mini attains 82% on the Capability Dimension Aggregated Mean; Gemini-2.5-pro, 79%; Doubao-1.5-thinking-pro, 72%; DeepSeek-R1, 71%. Open-source non-thinking LLMs lag at 8–16% (Shi et al., 20 May 2025).
  • Modality Effects: Consistently higher scores for text-only tasks; however, VLMs such as Gemini-2.5-pro occasionally surpass LLMs in visual subtasks (e.g., Jigsaw, Visual Wordle), indicating superior vision–language integration.
  • Within-Family Consistency: O1/O3-mini models excel at spatial/geometric games; Gemini series at puzzle/mathematical tasks.
  • Model Scale and Tuning: Larger models outperform smaller; “thinking” variants (e.g., Claude-thinking, Doubao-thinking) achieve gains over instruct-tuned baselines.
  • Reasoning Paradigms: Annotation reveals use of code, mathematical, algorithm-specific, and natural-language reasoning. Disabling mathematical reasoning leads to the steepest performance drop; code ablation can occasionally improve outcomes by forcing non-code reasoning. High-performing models maintain robustness across ablations.
  • Reinforcement Learning Impact: RL fine-tuning empirically improves performance (Doubao-1.5-thinking-pro achieves 72% and low cross-seed variance).
  • Response Length: Longer generated answers correlate with higher scores, with minimal incremental gain beyond approximately 200 tokens.

These results reinforce the need for knowledge-orthogonal, multi-turn, and modality-diverse evaluation strategies to faithfully characterize model reasoning.

6. Usage, Extensibility, and Best Practices

KORGym is distributed as both a Python package (pip install korgym) and as source at https://github.com/multimodal-art-projection/KORGym. The interface abstracts game environments compatible with traditional Gym workflow:

1
2
3
4
5
6
7
import korgym
env = korgym.make("Maze-Text", seed=42)
state = env.reset()
while not done:
    prompt = env.print_board()
    action = model.generate(prompt)
    state, reward, done = env.step(action)

Custom games are defined by subclassing korgym.core.BaseGame and providing implementations for generate(seed, difficulty), print_board(state), and verify(state, action). Games are registered via korgym.register("MyGame-v0", entry_point=...). Reward schemes (binary, proportional, cumulative, custom) are assigned via the BaseGame interface.

Best practice recommendations include:

  • Isolate reasoning by employing zero-shot prompts to minimize chain-of-thought prior influence.
  • Standardize model sampling parameters (temperature, top-p) for reproducibility.
  • Evaluate with multiple random seeds (≥20) to counteract stochasticity.
  • Apply dimension-wise normalization for fair aggregation across heterogeneous games.

7. Limitations and Future Research Directions

KORGym’s current scope provides robust, multi-modal, and RL-capable evaluation of LLM reasoning disentangled from background knowledge. Identified limitations and proposed avenues include:

  • Extending to games with rich social interaction or explicit opponent modeling, while controlling for extraneous variability.
  • Incorporating dynamic difficulty adjustment mechanics to extend the diagnostic power of failure threshold probing.
  • Advancing toward human-in-the-loop assessment for negotiation, collaboration, and open-ended interactive tasks.

KORGym thus forms a reproducible, extensible benchmark that addresses the fundamental deficits of prior LLM and VLM assessment methodologies and offers a comprehensive empirical yardstick for evolving reasoning research and model development (Shi et al., 20 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to KORGym.