Real-Time Reasoning Gym

Updated 11 November 2025

Real-Time Reasoning Gym is a structured evaluation environment that tests agents' ability to perform complex reasoning under strict time constraints.
It integrates asynchronous MDP formulations and modular architectures to optimize both logical accuracy and reaction latency.
The gym benchmarks various agent paradigms across diverse tasks—from abstract logic to visual reasoning—enabling dynamic curriculum and performance evaluation.

A real-time reasoning gym is a structured evaluation and training environment designed to test agents’ ability to perform complex reasoning tasks under active environmental dynamics and explicit temporal constraints. The paradigm generalizes the classic reinforcement learning (RL) gym concept by incorporating temporal, perceptual, and cognitive requirements essential for modeling agent behavior in high-stakes, fast-evolving settings. Such environments play a central role in benchmarking LLMs, vision-LLMs (VLMs), model-free and model-based RL agents, and symbolically driven planners across a spectrum of reasoning domains.

1. Formalization and Core Principles

Real-time reasoning gyms are built on asynchronous or strictly time-stepped Markov decision process (MDP) formulations. The defining feature is that environment state advances according to a fixed schedule—measured in either wall-clock time or by surrogate indicators such as the number of tokens generated by a LLM—irrespective of agent computation completion. Formally, at each discrete time $t$ , the environment is in state $s_t \in \mathcal S$ , and allows action $a_t \in \mathcal A$ if the agent responds within a pre-defined budget. Otherwise, a default action $a_\mathrm{default}$ is imposed:

$s_{t+1} = T(s_t, a_t)$

where $a_t$ is the agent’s proposed action, and $a_t = a_\mathrm{default}$ if the agent fails to respond within its computational allowance (e.g., $N_{T_\mathcal E}$ tokens or $T_E$ seconds).

This structure mandates simultaneous optimization of logical soundness and reaction latency. The gym’s step function enforces token or time budgets, and persistent logs are maintained for all action-observation pairs and reward assignments (Wen et al., 7 Nov 2025).

2. Architectures and Systems

Most real-time reasoning gym platforms implement modular architectures with standardized interfaces to facilitate agent-environment-coordination under time pressure. Common system modules include:

Module	Role	Example Frameworks
Environment Wrapper	Enforces token/time budgets, default actions, event stepping	RealTimeEnv (RealTime Reasoning Gym), KORGym Game Interaction Module
Inference/Policy	Manages model inference under budget, supports batching/streaming	AgileThinker, Inference Modules in KORGym
Task Generator	Procedural environment and instance creation, dynamic difficulty	KORGym, Reasoning Gym, KnotGym
Evaluation Module	Aggregation of per-episode results, normalization, leaderboard output	KORGym, Reasoning Gym, gr-libs

Agents can be instantiated in reactive (bounded compute, single-step) or planning (multi-step, deliberative, multi-action) modes—or hybridized through dual-threaded execution as in AgileThinker (Wen et al., 7 Nov 2025). Games and tasks are standardized to allow batched evaluation, online RL loops, and cross-method comparison (Shi et al., 20 May 2025, Stojanovski et al., 30 May 2025).

3. Task Domains and Complexity Scaling

Real-time reasoning gyms expose a wide variety of benchmark tasks, each emphasizing distinct reasoning skills, perceptual load, and action complexities:

Abstract Reasoning and Logic: Algebra, propositional logic, graph theory, circuit logic, syllogisms (Reasoning Gym (Stojanovski et al., 30 May 2025))
Mathematical and Statistical Computations: Polynomial equations, prime factorization, arithmetic puzzles (KORGym, RG)
Spatial and Geometric Planning: Polygon area, maze navigation, knot tying/manipulation (KORGym, KnotGym (Chen et al., 23 May 2025))
Game-Theoretic and Multi-Agent Scenarios: 2048, N-point, Evolution of Trust, Overcooked partner coordination (KORGym, RealTime Reasoning Gym)
Temporal Control and Hazard Avoidance: Freeway, Snake (RealTime Reasoning Gym (Wen et al., 7 Nov 2025))
Visual and Multimodal Reasoning: Jigsaw puzzle, Visual Sokoban, ARC-style cognition (KORGym, KnotGym)

Difficulty is adjusted continuously via task parameterization: expanding action sets, increasing state space dimensionality (e.g., more grid cells, more possible knot crossings), increasing horizon length, or compounding distractor structures. Dynamic curricula, produced on-the-fly based on performance, are supported (Stojanovski et al., 30 May 2025).

4. Evaluation Protocols, Metrics, and Tooling

Evaluation is structured around “episodes,” typically multi-step, terminating on goal completion, failure, or time exhaustion. Key metrics include:

Metric	Definition and Significance
Success Rate ( $S$ )	$\#\,$ successful episodes $/\#$ episodes; measures final task completion
Cumulative Score	Sum or normalized reward across trajectory; e.g.\ merged tiles in 2048
Latency	Wall-clock or token-count per agent action; central for real-time compliance
Response Length	Tokens per output, correlated with reasoning depth but tied to time pressure
Rank/Accuracy	For model selection over candidate goals; e.g.\ recognition accuracy in GR

Normalization across heterogeneous games uses log-transformed and min-max aggregation to produce a “Capability Dimension Aggregated Mean” for model-to-model and task-to-task fairness (Shi et al., 20 May 2025). Controlled experiments systematically vary cognitive load, action set size, and time budgets. Visualization/debug utilities (e.g., trajectory overlays, Matplotlib plots, token-timestamp logs) are standard (Stojanovski et al., 30 May 2025, Matan et al., 27 Sep 2025).

Procedural metrics—such as deductive efficiency (first appearance of correct hypothesis), consistency (Spearman’s coefficient), and hopping penalty—are increasingly common in environments that require step-wise reasoning such as GameArena (Hu et al., 2024).

5. Methodological Innovations and Agent Paradigms

Agent paradigms in real-time reasoning gyms are often differentiated by their decision-time architecture:

Reactive Agents: Produce a single action within strict per-step time or token budget, modeled as bounded policy rollout. Perform robustly when tasks are simple or time budgets are large, but degrade rapidly as complexity or time pressure increases.
Planning Agents: Engage in open-ended or code-generative chain-of-thought reasoning to construct a multi-step plan, then execute sequentially. Excel in high-complexity/low time-pressure settings but are fragile under tight real-time constraints.
Hybrid Agents (e.g., AgileThinker): Simultaneously run long-horizon planning and rapid reactive computation in parallel, dividing the budget into $\tau_P$ (planning) and $\tau_R$ (reactive), then switching control at tick boundaries. This architecture maintains higher overall scores across both high cognitive load and limited time regimes, as demonstrated in Freeway, Snake, and Overcooked (e.g., AgileThinker maintains robust performance as planning and reactive baselines both collapse under increased demands (Wen et al., 7 Nov 2025)). This approach requires careful calibration of $\tau_R$ (typically set at the $90^\mathrm{th}$ percentile of reactive usage).

Policy architectures include direct sequence prediction, goal-conditioned RL, deep metric learning for goal recognition (GRAML (Matan et al., 27 Sep 2025)), and chain-of-thought prompting for VLMs (KnotGym, GPT-4.1-nano). For reinforcement learning agents, on-policy (e.g., PPO, DAPO, VAPO) and curriculum-based schemes are available (e.g., policy-gradient updates with reward shaping and trajectory sharing, as in KORGym and RG (Shi et al., 20 May 2025, Stojanovski et al., 30 May 2025)).

6. Empirical Findings and Benchmark Results

Empirical studies across high-diversity benchmarks report the following:

Modality and Benchmark Breadth: Text-based agents typically outperform their visual-modality counterparts on the same game instance. Some closed-source VLMs (Gemini-2.5-pro, GPT-4o) demonstrate narrower modality-induced gaps (Shi et al., 20 May 2025).
Model Family Consistency: Within-model family performance is highly conserved across tasks—so-called “strength–weakness profiles”—and variants fine-tuned for explicit “thinking” outperform those relying solely on instruction tuning (Shi et al., 20 May 2025).
Real-Time Latency Effects: As token/time budgets shrink, planning-only architectures’ scores fall precipitously, while reactive agents plateau at lower-mean performance. Hybrid two-threaded agents (e.g., AgileThinker) achieve significant and statistically validated robustness under both increased complexity and severe time budgets (Wen et al., 7 Nov 2025).
Response Length Correlations: A positive correlation ( $r\approx0.8$ ) is observed between response token count and normalized score, though saturation is evident beyond roughly 300 tokens per output (Shi et al., 20 May 2025).
Reasoning Paradigm Ablation: Disabling explicit mathematical paradigms within LLM prompting induces the most substantial performance loss, especially for non-cutting-edge models. Stronger models display greater paradigm robustness (Shi et al., 20 May 2025).
Task-Specific Difficulties and Generalization: Games requiring long-horizon planning, memory, or rapid stateful adaptation (e.g., Overcooked) expose weaknesses in current agents. In KnotGym, higher knot complexity (crossings $k$ ) dramatically degrades the performance of all methods save for advanced model-based agents (DreamerV3) (Chen et al., 23 May 2025).

7. Implications, Limitations, and Future Directions

Real-time reasoning gyms expose vital dimensions for next-generation AI systems:

Temporal Generalization: They probe agents’ ability to balance depth of reasoning with responsiveness—a critical criterion for deployment in dynamic, human-facing, or physically situated environments.
Knowledge Orthogonality: Platforms like KORGym systematically eliminate static factual dependencies, ensuring that agent success reflects genuine reasoning and adaptation ( $\beta\approx0$ , i.e., near-zero reliance on pretrained facts).
Scaling and Curriculum: On-the-fly procedural generation allows for unlimited difficulty scaling and supports curriculum learning or adaptation to agent performance in real time (Stojanovski et al., 30 May 2025).
Extensions: Challenges persist in incorporating true multi-agent reasoning (e.g., negotiation, adversarial play), human-in-the-loop evaluation for interpretability and difficulty calibration, and expanding benchmark domains via automated game generation—particularly important for out-of-distribution generalization.
Practical Limitations: Real-time environments can underutilize advanced model interaction potential in purely zero-shot or single-turn settings. Current VLMs remain bottlenecked by perceptual latency and low-level control in highly granular environments (e.g., KnotGym).

A plausible implication is that future research must address the design and evaluation of agents that can fluidly interleave reactive and deliberative reasoning, learn from both procedural and chain-of-thought feedback, and adaptively calibrate resource commitment to task structure and environmental volatility.

In summary, the real-time reasoning gym construct formalizes the rigorous, temporally grounded evaluation of computational reasoning, offering a foundation for developing agents capable of logical and timely decision-making under evolving and uncertain conditions. These platforms now underpin much of the critical experimental work in RL, LLM evaluation, and multimodal reasoning, and continue to drive innovation in agent design and theory.