KAGE-Bench: Visual Generalization in RL
- KAGE-Bench is a diagnostic benchmark that isolates individual visual factors (e.g., backgrounds, filters, lighting) impacting pixel-based RL performance.
- It systematically employs factorized rendering axes to reveal failure modes and measure generalization gaps using controlled train–eval configuration pairs.
- Its vectorized, JAX-based implementation enables large-scale, reproducible experiments with high throughput, supporting robust RL research.
KAGE-Bench is a high-throughput visual generalization benchmark for pixel-based reinforcement learning (RL) that isolates individual visual factors impacting agent performance. Built atop the KAGE-Env JAX-native 2D platformer, it enables controlled, reproducible evaluation across a set of systematically factorized observation axes, providing precise diagnostics of visual robustness and failure modes under distribution shift (Cherepanov et al., 20 Jan 2026).
1. Formalization: KAGE-Env and Observation Factorization
KAGE-Env defines a latent Markov Decision Process (MDP) with state space (positions of platforms, agents, and non-player characters), discrete action space (representing 8-way combinations of Left, Right, and Jump via bitmasking), and a fixed transition kernel implementing 2D platformer physics. The reward function
shapes for forward progress, penalizes jumping and idling, and enforces an episode length with only timeouts for termination.
Observation rendering is parameterized by a configuration vector , each element controlling a specific visual axis (e.g., background, agent sprites, photometric filters). The rendered pixel observation is produced by a deterministic or randomized function . The observation kernel is , ensuring that all axis controls affect only visual appearance, without altering latent dynamics or rewards.
A pixel policy induces a state-conditional policy under visual parameterization : and hence the joint law on is preserved under the shift from to by running in the latter. Therefore, any difference in expected return quantifies the policy’s axis-specific visual generalization.
2. Known-Axis Benchmark Suite Construction
KAGE-Bench specifies six families of visual axes, each parameterizing distinct, independently controllable components of the rendering pipeline:
- Agent Appearance: geometric shape (circle, line), color, or sprite skin.
- Background: solid colors, random noise, curated images.
- Distractors: animated or static extraneous objects (e.g., NPC skeletons, agent-shaped imposters).
- Lighting/Effects: dynamic effects such as global/point lights, varying intensity or falloff parameters.
- Filters: photometric operations (brightness, contrast, gamma, hue, Gaussian noise, pixelation, vignetting).
- Layout: platform color palette changes (e.g., cyan palette to red palette).
For each axis, the benchmark enumerates a set of train–eval configuration pairs , ensuring that only a single axis changes (all other rendering/dynamics parameters are fixed). In total, 34 such pairs are defined, enabling axis-resolved measurement of generalization gaps.
3. Experimental Protocol and Evaluation Metrics
The canonical agent is a PPO-CNN baseline from CleanRL, deployed with 128 parallel environments and batch size , employing a 3-layer convolutional encoder and fully connected policy/value heads. Each configuration pair is evaluated over 10 random seeds, recording multiple trajectory-level and return-based metrics:
- Maximum Return (): peak mean episodic reward over all training checkpoints.
- Passed Distance (): , absolute horizontal progress.
- Progress (): normalized as fraction of required course length.
- Success Rate (): proportion of episodes completing the task ().
Generalization gap per metric is defined as the absolute or relative difference between train and evaluation performance (e.g., and ). The “maximum-over-training” protocol avoids checkpoint-selection artifacts by using the best checkpoint for each seed.
4. Analysis of Visual Generalization and Failure Modes
Axis-level aggregation yields the following core findings:
- Filters and Effects: Despite only moderate drops in forward motion (–), success rates collapse almost entirely (–). Policies under photometric and lighting shifts often move forward but fail to complete the course, indicating that measuring only return or distance can mask severe failures.
- Background Shifts: Large degradations in both progress (30.5%) and success rate (53.3%). For example, switching from a black to a noise background induces a drop.
- Distractors and Layout: Minor distance gaps (5%) but major drops in completion rate (). Increasing “same-as-agent” distractors incrementally degrades SR while maintaining near-normal distances.
- Agent Appearance: The least harmful axis; shape or color changes yield modest SR reductions (), confirming such perturbations are comparatively benign.
The benchmark exposes instances where return is decoupled from task completion, especially under strong photometric/lighting shift: agents traverse substantial fractions of the course but rarely succeed according to the defined criteria.
5. Vectorized JAX Implementation and Throughput
KAGE-Env is fully implemented in JAX (including both simulator dynamics and rendering). All randomness is managed via explicit PRNG keys, with environment interaction (reset, step) vectorized through jax.vmap, and multi-step rollouts via jax.lax.scan. This design enables single-call batched simulation of parallel environments on one modern GPU.
Performance measurements report sustained peak throughput up to steps/sec on NVIDIA H100 hardware, with all rendering axes combined. Configurations of comparable complexity consistently yield $10$–$30$ million steps/sec, enabling large-scale ablation studies and multi-seed sweeps in minutes.
6. Applications, Limitations, and Future Directions
KAGE-Bench provides a uniquely diagnostic platform for:
- Quantitative visual generalization analysis: Its axis-isolated pairs expose policy sensitivities that are otherwise conflated in existing benchmarks.
- Ablation and robustness evaluation: The high simulation throughput supports systematic method comparisons and perturbation studies at scale.
- Method development: Facilitates fair, reproducible testing of augmentation, invariance, and contrastive learning techniques within a factorized visual space.
Limitations:
- The current suite uses a single 2D platforming scenario; tasks such as navigation or object manipulation remain unexplored.
- Selection of axes is hand-designed; auto-discovery or continuous axis interpolation is not presently addressed.
- Integrating state-of-the-art robust RL methods into the KAGE-Bench framework is ongoing.
A plausible implication is that KAGE-Bench's precise factorization of visual confounds and its scalable implementation may become a reference diagnostic standard for future pixel-based RL research (Cherepanov et al., 20 Jan 2026).