KAGE-Bench: Visual Generalization in RL

Updated 21 January 2026

KAGE-Bench is a diagnostic benchmark that isolates individual visual factors (e.g., backgrounds, filters, lighting) impacting pixel-based RL performance.
It systematically employs factorized rendering axes to reveal failure modes and measure generalization gaps using controlled train–eval configuration pairs.
Its vectorized, JAX-based implementation enables large-scale, reproducible experiments with high throughput, supporting robust RL research.

KAGE-Bench is a high-throughput visual generalization benchmark for pixel-based reinforcement learning (RL) that isolates individual visual factors impacting agent performance. Built atop the KAGE-Env JAX-native 2D platformer, it enables controlled, reproducible evaluation across a set of systematically factorized observation axes, providing precise diagnostics of visual robustness and failure modes under distribution shift (Cherepanov et al., 20 Jan 2026).

1. Formalization: KAGE-Env and Observation Factorization

KAGE-Env defines a latent Markov Decision Process (MDP) with state space $S$ (positions of platforms, agents, and non-player characters), discrete action space $A \equiv \{0, \ldots, 7\}$ (representing 8-way combinations of Left, Right, and Jump via bitmasking), and a fixed transition kernel $P(s'|s,a)$ implementing 2D platformer physics. The reward function

$r_t = \alpha_1 \cdot \max(0, x_{t+1} - x_t^{\max}) - \left[ \alpha_2\,\mathbb{I}_{\{\text{Jump}(a_t)\}} + \alpha_3 + \alpha_4\,\mathbb{I}_{\{\text{idling}(x_t, x_{t+1})\}} \right]$

shapes for forward progress, penalizes jumping and idling, and enforces an episode length $T=500$ with only timeouts for termination.

Observation rendering is parameterized by a configuration vector $\xi = (A_1, ..., A_K)$ , each element $A_k$ controlling a specific visual axis (e.g., background, agent sprites, photometric filters). The rendered pixel observation $o \in \{0,\ldots,255\}^{H\times W \times 3}$ is produced by a deterministic or randomized function $o = f(s; A_1, ..., A_K)$ . The observation kernel is $O_\xi(o|s) = \delta(o - f(s; \xi))$ , ensuring that all axis controls affect only visual appearance, without altering latent dynamics or rewards.

A pixel policy $\pi(a|o)$ induces a state-conditional policy under visual parameterization $\xi$ : $\pi_\xi(a|s) = \int_{\mathcal{O}} \pi(a|o)\,O_\xi(do|s)$ and hence the joint law on $(s_t, a_t)$ is preserved under the shift from $M_\xi = (S, A, P, r, O_\xi, ...)$ to $M = (S, A, P, r)$ by running $\pi_\xi$ in the latter. Therefore, any difference in expected return $J(\pi; M_\xi) - J(\pi; M_{\xi'})$ quantifies the policy’s axis-specific visual generalization.

2. Known-Axis Benchmark Suite Construction

KAGE-Bench specifies six families of visual axes, each parameterizing distinct, independently controllable components of the rendering pipeline:

Agent Appearance: geometric shape (circle, line), color, or sprite skin.
Background: solid colors, random noise, curated images.
Distractors: animated or static extraneous objects (e.g., NPC skeletons, agent-shaped imposters).
Lighting/Effects: dynamic effects such as global/point lights, varying intensity or falloff parameters.
Filters: photometric operations (brightness, contrast, gamma, hue, Gaussian noise, pixelation, vignetting).
Layout: platform color palette changes (e.g., cyan palette to red palette).

For each axis, the benchmark enumerates a set of train–eval configuration pairs $(\xi^{\text{train}}, \xi^{\text{eval}})$ , ensuring that only a single axis changes (all other rendering/dynamics parameters are fixed). In total, 34 such pairs are defined, enabling axis-resolved measurement of generalization gaps.

3. Experimental Protocol and Evaluation Metrics

The canonical agent is a PPO-CNN baseline from CleanRL, deployed with 128 parallel environments and batch size $16{,}384$ , employing a 3-layer convolutional encoder and fully connected policy/value heads. Each configuration pair is evaluated over 10 random seeds, recording multiple trajectory-level and return-based metrics:

Maximum Return ( $J$ ): peak mean episodic reward over all training checkpoints.
Passed Distance ( $F_{\text{dist}}$ ): $x_T - x_0$ , absolute horizontal progress.
Progress ( $F_{\text{prog}}$ ): normalized as fraction of required course length.
Success Rate ( $F_{\text{succ}}$ ): proportion of episodes completing the task ( $F_{\text{dist}} \geq D$ ).

Generalization gap per metric is defined as the absolute or relative difference between train and evaluation performance (e.g., $\Delta\text{SR} = \text{SR}_\text{train} - \text{SR}_\text{eval}$ and $\Delta\text{Dist}\% = 100{\cdot}(\text{Dist}_\text{train}-\text{Dist}_\text{eval})/\text{Dist}_\text{train}$ ). The “maximum-over-training” protocol avoids checkpoint-selection artifacts by using the best checkpoint for each seed.

4. Analysis of Visual Generalization and Failure Modes

Axis-level aggregation yields the following core findings:

Filters and Effects: Despite only moderate drops in forward motion ( $\Delta\text{Dist}\approx12$ – $21\%$ ), success rates collapse almost entirely ( $\Delta\text{SR}\approx80$ – $87\%$ ). Policies under photometric and lighting shifts often move forward but fail to complete the course, indicating that measuring only return or distance can mask severe failures.
Background Shifts: Large degradations in both progress (30.5%) and success rate (53.3%). For example, switching from a black to a noise background induces a $\Delta\text{SR}=98.9\%$ drop.
Distractors and Layout: Minor distance gaps ( $<$ 5%) but major drops in completion rate ( $\Delta\text{SR}=31\%,\,63\%$ ). Increasing “same-as-agent” distractors incrementally degrades SR while maintaining near-normal distances.
Agent Appearance: The least harmful axis; shape or color changes yield modest SR reductions ( $21.1\%$ ), confirming such perturbations are comparatively benign.

The benchmark exposes instances where return is decoupled from task completion, especially under strong photometric/lighting shift: agents traverse substantial fractions of the course but rarely succeed according to the defined criteria.

5. Vectorized JAX Implementation and Throughput

KAGE-Env is fully implemented in JAX (including both simulator dynamics and rendering). All randomness is managed via explicit PRNG keys, with environment interaction (reset, step) vectorized through jax.vmap, and multi-step rollouts via jax.lax.scan. This design enables single-call batched simulation of $2^{16} = 65536$ parallel environments on one modern GPU.

Performance measurements report sustained peak throughput up to $33\times10^6$ steps/sec on NVIDIA H100 hardware, with all rendering axes combined. Configurations of comparable complexity consistently yield $10$–$30$ million steps/sec, enabling large-scale ablation studies and multi-seed sweeps in minutes.

6. Applications, Limitations, and Future Directions

KAGE-Bench provides a uniquely diagnostic platform for:

Quantitative visual generalization analysis: Its axis-isolated pairs expose policy sensitivities that are otherwise conflated in existing benchmarks.
Ablation and robustness evaluation: The high simulation throughput supports systematic method comparisons and perturbation studies at scale.
Method development: Facilitates fair, reproducible testing of augmentation, invariance, and contrastive learning techniques within a factorized visual space.

Limitations:

The current suite uses a single 2D platforming scenario; tasks such as navigation or object manipulation remain unexplored.
Selection of axes is hand-designed; auto-discovery or continuous axis interpolation is not presently addressed.
Integrating state-of-the-art robust RL methods into the KAGE-Bench framework is ongoing.

A plausible implication is that KAGE-Bench's precise factorization of visual confounds and its scalable implementation may become a reference diagnostic standard for future pixel-based RL research (Cherepanov et al., 20 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KAGE-Bench.

KAGE-Bench: Visual Generalization in RL

1. Formalization: KAGE-Env and Observation Factorization

2. Known-Axis Benchmark Suite Construction

3. Experimental Protocol and Evaluation Metrics

4. Analysis of Visual Generalization and Failure Modes

5. Vectorized JAX Implementation and Throughput

6. Applications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

KAGE-Bench: Visual Generalization in RL

1. Formalization: KAGE-Env and Observation Factorization

2. Known-Axis Benchmark Suite Construction

3. Experimental Protocol and Evaluation Metrics

4. Analysis of Visual Generalization and Failure Modes

5. Vectorized JAX Implementation and Throughput

6. Applications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research