Papers
Topics
Authors
Recent
Search
2000 character limit reached

KAGE-Bench: Visual Generalization in RL

Updated 21 January 2026
  • KAGE-Bench is a diagnostic benchmark that isolates individual visual factors (e.g., backgrounds, filters, lighting) impacting pixel-based RL performance.
  • It systematically employs factorized rendering axes to reveal failure modes and measure generalization gaps using controlled train–eval configuration pairs.
  • Its vectorized, JAX-based implementation enables large-scale, reproducible experiments with high throughput, supporting robust RL research.

KAGE-Bench is a high-throughput visual generalization benchmark for pixel-based reinforcement learning (RL) that isolates individual visual factors impacting agent performance. Built atop the KAGE-Env JAX-native 2D platformer, it enables controlled, reproducible evaluation across a set of systematically factorized observation axes, providing precise diagnostics of visual robustness and failure modes under distribution shift (Cherepanov et al., 20 Jan 2026).

1. Formalization: KAGE-Env and Observation Factorization

KAGE-Env defines a latent Markov Decision Process (MDP) with state space SS (positions of platforms, agents, and non-player characters), discrete action space A{0,,7}A \equiv \{0, \ldots, 7\} (representing 8-way combinations of Left, Right, and Jump via bitmasking), and a fixed transition kernel P(ss,a)P(s'|s,a) implementing 2D platformer physics. The reward function

rt=α1max(0,xt+1xtmax)[α2I{Jump(at)}+α3+α4I{idling(xt,xt+1)}]r_t = \alpha_1 \cdot \max(0, x_{t+1} - x_t^{\max}) - \left[ \alpha_2\,\mathbb{I}_{\{\text{Jump}(a_t)\}} + \alpha_3 + \alpha_4\,\mathbb{I}_{\{\text{idling}(x_t, x_{t+1})\}} \right]

shapes for forward progress, penalizes jumping and idling, and enforces an episode length T=500T=500 with only timeouts for termination.

Observation rendering is parameterized by a configuration vector ξ=(A1,...,AK)\xi = (A_1, ..., A_K), each element AkA_k controlling a specific visual axis (e.g., background, agent sprites, photometric filters). The rendered pixel observation o{0,,255}H×W×3o \in \{0,\ldots,255\}^{H\times W \times 3} is produced by a deterministic or randomized function o=f(s;A1,...,AK)o = f(s; A_1, ..., A_K). The observation kernel is Oξ(os)=δ(of(s;ξ))O_\xi(o|s) = \delta(o - f(s; \xi)), ensuring that all axis controls affect only visual appearance, without altering latent dynamics or rewards.

A pixel policy π(ao)\pi(a|o) induces a state-conditional policy under visual parameterization ξ\xi: πξ(as)=Oπ(ao)Oξ(dos)\pi_\xi(a|s) = \int_{\mathcal{O}} \pi(a|o)\,O_\xi(do|s) and hence the joint law on (st,at)(s_t, a_t) is preserved under the shift from Mξ=(S,A,P,r,Oξ,...)M_\xi = (S, A, P, r, O_\xi, ...) to M=(S,A,P,r)M = (S, A, P, r) by running πξ\pi_\xi in the latter. Therefore, any difference in expected return J(π;Mξ)J(π;Mξ)J(\pi; M_\xi) - J(\pi; M_{\xi'}) quantifies the policy’s axis-specific visual generalization.

2. Known-Axis Benchmark Suite Construction

KAGE-Bench specifies six families of visual axes, each parameterizing distinct, independently controllable components of the rendering pipeline:

  1. Agent Appearance: geometric shape (circle, line), color, or sprite skin.
  2. Background: solid colors, random noise, curated images.
  3. Distractors: animated or static extraneous objects (e.g., NPC skeletons, agent-shaped imposters).
  4. Lighting/Effects: dynamic effects such as global/point lights, varying intensity or falloff parameters.
  5. Filters: photometric operations (brightness, contrast, gamma, hue, Gaussian noise, pixelation, vignetting).
  6. Layout: platform color palette changes (e.g., cyan palette to red palette).

For each axis, the benchmark enumerates a set of train–eval configuration pairs (ξtrain,ξeval)(\xi^{\text{train}}, \xi^{\text{eval}}), ensuring that only a single axis changes (all other rendering/dynamics parameters are fixed). In total, 34 such pairs are defined, enabling axis-resolved measurement of generalization gaps.

3. Experimental Protocol and Evaluation Metrics

The canonical agent is a PPO-CNN baseline from CleanRL, deployed with 128 parallel environments and batch size 16,38416{,}384, employing a 3-layer convolutional encoder and fully connected policy/value heads. Each configuration pair is evaluated over 10 random seeds, recording multiple trajectory-level and return-based metrics:

  • Maximum Return (JJ): peak mean episodic reward over all training checkpoints.
  • Passed Distance (FdistF_{\text{dist}}): xTx0x_T - x_0, absolute horizontal progress.
  • Progress (FprogF_{\text{prog}}): normalized as fraction of required course length.
  • Success Rate (FsuccF_{\text{succ}}): proportion of episodes completing the task (FdistDF_{\text{dist}} \geq D).

Generalization gap per metric is defined as the absolute or relative difference between train and evaluation performance (e.g., ΔSR=SRtrainSReval\Delta\text{SR} = \text{SR}_\text{train} - \text{SR}_\text{eval} and ΔDist%=100(DisttrainDisteval)/Disttrain\Delta\text{Dist}\% = 100{\cdot}(\text{Dist}_\text{train}-\text{Dist}_\text{eval})/\text{Dist}_\text{train}). The “maximum-over-training” protocol avoids checkpoint-selection artifacts by using the best checkpoint for each seed.

4. Analysis of Visual Generalization and Failure Modes

Axis-level aggregation yields the following core findings:

  • Filters and Effects: Despite only moderate drops in forward motion (ΔDist12\Delta\text{Dist}\approx1221%21\%), success rates collapse almost entirely (ΔSR80\Delta\text{SR}\approx8087%87\%). Policies under photometric and lighting shifts often move forward but fail to complete the course, indicating that measuring only return or distance can mask severe failures.
  • Background Shifts: Large degradations in both progress (30.5%) and success rate (53.3%). For example, switching from a black to a noise background induces a ΔSR=98.9%\Delta\text{SR}=98.9\% drop.
  • Distractors and Layout: Minor distance gaps (<<5%) but major drops in completion rate (ΔSR=31%,63%\Delta\text{SR}=31\%,\,63\%). Increasing “same-as-agent” distractors incrementally degrades SR while maintaining near-normal distances.
  • Agent Appearance: The least harmful axis; shape or color changes yield modest SR reductions (21.1%21.1\%), confirming such perturbations are comparatively benign.

The benchmark exposes instances where return is decoupled from task completion, especially under strong photometric/lighting shift: agents traverse substantial fractions of the course but rarely succeed according to the defined criteria.

5. Vectorized JAX Implementation and Throughput

KAGE-Env is fully implemented in JAX (including both simulator dynamics and rendering). All randomness is managed via explicit PRNG keys, with environment interaction (reset, step) vectorized through jax.vmap, and multi-step rollouts via jax.lax.scan. This design enables single-call batched simulation of 216=655362^{16} = 65536 parallel environments on one modern GPU.

Performance measurements report sustained peak throughput up to 33×10633\times10^6 steps/sec on NVIDIA H100 hardware, with all rendering axes combined. Configurations of comparable complexity consistently yield $10$–$30$ million steps/sec, enabling large-scale ablation studies and multi-seed sweeps in minutes.

6. Applications, Limitations, and Future Directions

KAGE-Bench provides a uniquely diagnostic platform for:

  • Quantitative visual generalization analysis: Its axis-isolated pairs expose policy sensitivities that are otherwise conflated in existing benchmarks.
  • Ablation and robustness evaluation: The high simulation throughput supports systematic method comparisons and perturbation studies at scale.
  • Method development: Facilitates fair, reproducible testing of augmentation, invariance, and contrastive learning techniques within a factorized visual space.

Limitations:

  • The current suite uses a single 2D platforming scenario; tasks such as navigation or object manipulation remain unexplored.
  • Selection of axes is hand-designed; auto-discovery or continuous axis interpolation is not presently addressed.
  • Integrating state-of-the-art robust RL methods into the KAGE-Bench framework is ongoing.

A plausible implication is that KAGE-Bench's precise factorization of visual confounds and its scalable implementation may become a reference diagnostic standard for future pixel-based RL research (Cherepanov et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KAGE-Bench.