Papers
Topics
Authors
Recent
Search
2000 character limit reached

KAGE-Env: JAX Testbed for Visual RL

Updated 21 January 2026
  • KAGE-Env is a specialized 2D platformer environment that factors the visual observation process into independent axes for precise analysis of RL agents.
  • It features six axis-isolated benchmark suites with 34 train–evaluation pairs to systematically diagnose failure modes under varied visual perturbations.
  • Its efficient JAX-based implementation enables parallel simulation of up to 2^16 environments, achieving high throughput on modern GPUs for scalable research.

KAGE-Env is a JAX-native 2D side-scrolling platformer environment purpose-built to enable rigorous and disentangled evaluation of visual generalization in pixel-based reinforcement learning (RL). It exposes a formal and practically implemented factorization of the observation process into independently controllable visual axes, enabling systematic study of visual distribution shifts while holding all underlying control and reward dynamics fixed. The associated benchmark, KAGE-Bench, defines six axis-isolated suites comprising 34 precisely constructed train–evaluation configuration pairs, making KAGE-Env a canonical testbed for diagnosing and quantifying failure modes of current RL algorithms under controlled visual perturbations (Cherepanov et al., 20 Jan 2026).

1. Observation Process Factorization and Environment Formalism

KAGE-Env implements a parameterized family of visual partially observable Markov decision processes (POMDPs): Mξ=(S,A,P,r,Ω,Oξ,ρ0,γ)\mathcal{M}_\xi = (\mathcal{S}, \mathcal{A}, P, r, \Omega, O_\xi, \rho_0, \gamma) where:

  • S\mathcal{S}: latent physics state (e.g., agent position, velocity),
  • A={0,,7}\mathcal{A} = \{0, \ldots, 7\}: 3-bit discrete action space (Left/Right/Jump),
  • P(st+1st,at)P(s_{t+1} \mid s_t, a_t) and r(st,at)r(s_t, a_t): fixed, axis-invariant dynamics and reward,
  • Ω={0,,255}H×W×3\Omega = \{0, \ldots, 255\}^{H \times W \times 3}: RGB observation images,
  • Oξ(os)O_\xi(o \mid s): observation kernel (renderer) parameterized by visual configuration ξ\xi.

The crucial design property is that only OξO_\xi depends on ξ\xi, while PP and rr remain fixed. OξO_\xi is further architected as a composition of orthogonal, independently-configurable "visual axes," each corresponding to a deterministic or stochastic function in the JAX rendering pipeline: o=Oξ(s) =fbackground(s;ξbg)fcharacter(s;ξch)fnpc(s;ξnpc)fdistractors(s;ξdist)ffilters(s;ξfilt)feffects(s;ξeff)flayout(s;ξlay)\begin{aligned} o &= O_\xi(s) \ &= f_{\mathrm{background}}(s; \xi_{\mathrm{bg}}) \circ f_{\mathrm{character}}(s; \xi_{\mathrm{ch}}) \circ f_{\mathrm{npc}}(s; \xi_{\mathrm{npc}}) \circ f_{\mathrm{distractors}}(s; \xi_{\mathrm{dist}}) \circ f_{\mathrm{filters}}(s; \xi_{\mathrm{filt}}) \circ f_{\mathrm{effects}}(s; \xi_{\mathrm{eff}}) \circ f_{\mathrm{layout}}(s; \xi_{\mathrm{lay}}) \end{aligned}

Each axis admits granular control over aspects such as background palette/type, photometric filters (e.g., brightness, contrast, color shifts), lighting effects, agent and NPC sprite attributes, and spatial layout. For a given state ss, an observation oOξ(s)o \sim O_\xi(\cdot \mid s) is rendered, producing the pixel input to the RL policy π(ao)\pi(a \mid o). Marginalizing over rendered images induces a state-conditional action distribution: πξ(as)=Ωπ(ao)Oξ(dos)\pi_\xi(a \mid s) = \int_{\Omega} \pi(a \mid o) O_\xi(do \mid s) All effects of visual variation are thus mediated by axis-specific changes in πξ(as)\pi_\xi(a \mid s), with the ground-truth physics and objectives unchanged.

2. Known-Axis Benchmark Suites and Configuration Structure

KAGE-Bench operationalizes axis-level control via six suites, each consisting of paired train/eval configurations that differ exclusively along one visual axis while fixing all others. Each suite encodes one dimension of visual shift:

Suite Axis No. of Train–Eval Pairs Representative Variations
Agent Appearance ξch\xi_{\mathrm{ch}} 5 Shape (circle/line), color (teal/pink), sprite IDs
Background ξbg\xi_{\mathrm{bg}} 10 Solid color, random images, parallax, frequency
Distractors ξdist\xi_{\mathrm{dist}} 6 No/Same-as-agent/Incremental number of distractors
Effects ξeff\xi_{\mathrm{eff}} 3 Point light intensity, count, falloff
Filters ξfilt\xi_{\mathrm{filt}} 9 Brightness, contrast, hue, noise, pixelate, etc.
Layout ξlay\xi_{\mathrm{lay}} 1 Platform color, step height distribution

In every pair (ξtrain,ξeval)(\xi^{\mathrm{train}}, \xi^{\mathrm{eval}}), only one axis is changed, ensuring any measured generalization gap (ΔF\Delta F) is uniquely attributable to that axis. Distributions Dtrain,Deval\mathcal{D}_{\mathrm{train}}, \mathcal{D}_{\mathrm{eval}} are degenerate (singleton) on a YAML-specified configuration. Detailed configuration IDs and manipulations are enumerated in Table 2 of (Cherepanov et al., 20 Jan 2026).

3. JAX-Based Implementation and Parallelism

KAGE-Env is implemented entirely in JAX, leveraging jax.jit for compilation and jax.vmap for high-throughput data-parallel simulation. Core environment functions (reset, step) are JIT-compiled and vectorized, such that up to 2162^{16} independent environments can be executed in parallel on a single GPU without Python-side loops. Using an H100 GPU, observed throughput reaches 33 million environment steps per second even under maximal visual complexity.

Example initialization and execution code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import jax
from kage_bench import KAGE_Env, load_config_from_yaml

config = load_config_from_yaml("custom_config.yaml")
env = KAGE_Env(config)
reset_vec = jax.jit(jax.vmap(env.reset))
step_vec  = jax.jit(jax.vmap(env.step))

N = 2**16
keys = jax.random.split(jax.random.PRNGKey(0), N)
obs, info = reset_vec(keys)
states = info["state"]
actions = jax.random.randint(keys[0], (N,), 0, 8)
obs, rews, terms, truncs, info = step_vec(states, actions)

This architectural approach enables exhaustive, reproducible sweeps over visual axes at scale and facilitates precise diagnosis of axis-specific learning and generalization phenomena.

4. Evaluation Protocol, PPO-CNN Baseline, and Metrics

The standard benchmark protocol defines fixed experimental conditions:

  • Training: Policies are trained using PPO (CleanRL) with a CNN encoder (3 convolutional layers, 512-unit fully connected layer, independent actor and critic heads), for T=25T=25 million steps aggregated from 128 vectorized environments and rollout length 128. Hyperparameters are tabulated in Table A.9.
  • Evaluation: Models are checkpointed every 300 policy updates and evaluated over 128 episodes each for both train (in-distribution) and eval (out-of-distribution) configurations for the targeted axis.
  • Aggregation: For each metric, the maximum value achieved during training per seed is averaged over 10 seeds, and then mean performance gaps are computed across all configuration pairs within each suite.

All metrics are computed directly from the latent physics state via info["state"]:

  • Fdist=xTxinitF_{\mathrm{dist}} = x_T - x_{\mathrm{init}}: Absolute progress,
  • Fprog=(xTxinit)/DF_{\mathrm{prog}} = (x_T - x_{\mathrm{init}})/D: Normalized completion,
  • Fsucc=1{xTxinitD}F_{\mathrm{succ}} = \mathbf{1}\{x_T - x_{\mathrm{init}} \ge D\}: Success/failure,
  • G=t=0T1γtr(st,at)G = \sum_{t=0}^{T-1} \gamma^t r(s_t, a_t): Discounted return.

Generalization gap for a metric FF is quantified as

ΔF=E[F]trainE[F]eval\Delta F = \mathbb{E}[F]_{\mathrm{train}} - \mathbb{E}[F]_{\mathrm{eval}}

with axis-level summaries for ΔDist\Delta \mathrm{Dist}, ΔProg\Delta \mathrm{Prog}, ΔSR\Delta \mathrm{SR} (success rate), and ΔReturn\Delta \mathrm{Return} reported in Table 1.

5. Empirical Findings and Axis-Dependent Failure Modes

Analysis of PPO-CNN performance reveals severe and axis-dependent generalization failures:

  • Filters: Success rate collapses from 0.83 (train) to 0.11 (eval), a ΔSR=86.8%\Delta \mathrm{SR}=86.8\% gap.
  • Effects: 0.82 \rightarrow 0.16 (ΔSR=80.5%\Delta\mathrm{SR}=80.5\%).
  • Background: 0.90 \rightarrow 0.42 (ΔSR=53.3%\Delta\mathrm{SR}=53.3\%).
  • Layout: 0.86 \rightarrow 0.32 (ΔSR=62.8%\Delta\mathrm{SR}=62.8\%).
  • Distractors: 0.81 \rightarrow 0.56 (ΔSR=30.9%\Delta\mathrm{SR}=30.9\%).
  • Agent Appearance: 0.76 \rightarrow 0.60 (ΔSR=21.1%\Delta\mathrm{SR}=21.1\%).

Raw distance gap is smaller (3–30%), indicating that agents often retain basic locomotion but fail at task completion under photometric or lighting shift. Figure 1 (left) shows monotonic degradation in success as background complexity increases, while Figure 1 (right) demonstrates collapse as the number of agent-like distractors increases. Multiple per-configuration failures are detailed in Table 2 and Appendix D.

A plausible implication is that standard return or distance-based RL metrics may obscure latent generalization failures, as agents can appear functional in terms of movement yet become incapable of completing tasks under altered visual conditions.

6. State-Conditional Action Distributions and Attribution of Visual Effects

By construction, KAGE-Env enforces that all train–evaluation generalization gaps under axis variation are mediated via changes in the induced state-conditional action distribution: πξ(as)=π(ao)Oξ(dos)\pi_\xi(a \mid s) = \int \pi(a \mid o) O_\xi(do \mid s) where the environment’s dynamics and rewards are unaffected by ξ\xi. For every timestep tt: Pr(at=ast)=πξ(ast)\Pr(a_t = a \mid s_t) = \pi_\xi(a \mid s_t) This reduction justifies that isolating a single axis in ξ\xi directly attributes any difference in control policy—and thus any generalization gap—to that specific aspect of the visual renderer.

This suggests that KAGE-Env is especially well-suited to causal analysis of visual failure modes in vision-based RL, as confounding state dynamics and reward shifts are categorically excluded.

7. Context, Significance, and Application Scope

KAGE-Env and KAGE-Bench address known limitations of prior RL generalization benchmarks, which have entangled multiple covariates and precluded unambiguous attribution of failures to distinct visual dimensions. Through precise renderer factorization, axis-isolated suites, and scalable, reproducible simulation, KAGE-Env enables a new standard for evaluating robustness and transferability of pixel-based RL agents facing controlled visual distribution shifts. The observed strong axis-dependence of generalization failures underscores the necessity of axis-specific analysis for policy design and diagnostics (Cherepanov et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KAGE-Env.