Pixel-Based Reinforcement Learning
- Pixel-based reinforcement learning is a framework where agents learn directly from high-dimensional pixel data, enabling focused studies on visual generalization.
- KAGE-Env and KAGE-Bench decompose visual observations into approximately 90 controllable axes to isolate and evaluate specific visual shifts.
- The integration of JAX-parallelization and vectorization achieves high simulation throughput, facilitating rapid empirical testing and robust RL algorithm development.
Pixel-based reinforcement learning (RL) refers to RL frameworks in which agents perceive their environment directly through high-dimensional pixel observations, rather than structured or low-dimensional state representations. Addressing the visual generalization challenge—where agents must maintain performance across visually distinct but functionally identical scenarios—has become a central focus. KAGE-Env and KAGE-Bench exemplify state-of-the-art efforts to systematically decompose and rigorously evaluate the visual generalization capacity of pixel-based RL agents by factorizing the observation process into independently controllable visual axes, with a strictly fixed latent control problem (Cherepanov et al., 20 Jan 2026).
1. Observation Factorization and Visual POMDP Formulation
The KAGE-Env paradigm frames the environment as a family of visual partially observable Markov decision processes (POMDPs) indexed by a vector of visual parameters . The core objects are:
- Latent state space and states (e.g., position, velocity).
- Action space , encoding core action primitives as a bitmask.
- Observation space with default (RGB images).
- Visual parameter space (background, filters, lighting, sprites, distractors, layout, etc.).
- Rendering process (observation kernel) mapping from to , concretely implemented with a deterministic rendering function .
Crucially, the latent transition kernel and reward function are independent of , ensuring that the underlying MDP remains fixed while the entire visual channel is modifiable. This allows any change in agent performance under visual shifts to be attributed unambiguously to visual generalization, not to confounding changes in dynamics or rewards.
A reactive pixel policy in environment induces a state-conditional policy in the latent MDP. Theorem 1 of (Cherepanov et al., 20 Jan 2026) demonstrates formal equivalence at the level of trajectories and returns.
2. Visual Axes: Controllable Visual Dimensions
KAGE-Env exposes approximately 90 rendering and observation parameters as independent axes, which can be precisely manipulated. The principal axes are:
| Axis | Key Modes/Controls | Example Range/Description |
|---|---|---|
| Background | "black"/"color"/"noise"/"image", parallax, tiling, switch frequency | Single color, uniform random, 128 photos |
| Agent | 27 animated sprites, 9 shapes × 21 colors, animation fps, rotation | Skin, color, shape, pose |
| NPCs/Dist. | Sprite dirs, per-episode NPCs, sticky NPCs, distractor count/types/colors/kinematics | Moving shapes, world-fixed/sticky |
| Photometric | Brightness, contrast, gamma, saturation, hue, color temp, noise, blur, pixelation, pop filters | Stochastic and deterministic transforms |
| Lighting | Point light count (1–5), intensity, radius, falloff, color | Dynamic relighting |
| Layout | Level length, ground height, step size/prob, stair width, color palette | Procedural geometries |
| Physics | Gravity, move_speed, jump_force, friction, air resistance, max fall speed (optional) | Dynamical system |
Each train–eval pair isolates a shift along a single axis, ensuring causal attribution of downstream generalization failures or successes to the manipulated visual property.
3. Level Construction, Sampling, and Reward Specification
Levels are generated procedurally as side-scrolling layouts (sequences of horizontal segments and staircases) with parameters directly controlling step size, frequency, ground height, and platform thickness. The world is "push-scrolling," with the camera following the agent's progress and exposing new geometry incrementally.
At episode reset, the visual parameter vector is sampled from either a deterministic or , defined by KAGE-Bench to fix all but one axis per experiment.
Each observation is rendered in real time via a custom JAX shader pipeline, producing uint8 RGB frames per step.
The per-step reward is
where is horizontal position, is maximum reached so far, and are configurable penalty/reward coefficients. Episodes terminate only at a fixed time horizon.
4. JAX-Parallel Environment API and Vectorization
KAGE-Env is implemented entirely in JAX and designed to maximize high-throughput simulation and reproducibility:
- Python API: Environments are instantiated from YAML configurations and support
reset(key)andstep(state, action)methods. Internal state is a JAX pytree, ensuring functional purity. - Vectorization: Using JAX primitives (
jit,vmap,lax.scan), bulk operations over up to parallel environments can be fused for maximal GPU utilization, avoiding Python overhead. - Rendering: Observation generation (rendering) is a pure JAX function, thus offline video batch rendering via
jit(vmap(env.render))is direct. - Performance: On an NVIDIA H100 (80 GB), throughput reaches up to 33M environment steps/sec with simple visual settings, and 10–15M steps/sec with maximal axis complexity and environments. On Apple M3 Pro CPU with 1024 envs: 1–2M steps/sec.
Such throughput enables comprehensive studies: 10 seeds × 34 axis configuration pairs × 25M steps per run executed in a few GPU-hours.
5. Benchmarking Visual Generalization: KAGE-Bench Protocol
KAGE-Bench comprises six “known-axis” benchmark suites, each with 34 train–eval configuration pairs corresponding to isolated visual axis modifications. In each suite, only a single visual property is changed at evaluation, allowing for sharp tests of axis-dependent generalization.
Empirical results with a standard PPO-CNN baseline indicate pronounced axis-dependence: background and photometric shifts commonly induce failure (collapsing success rate), while agent-appearance shifts are more benign. Some visual shifts break task completion while preserving forward motion, demonstrating that cumulative return alone may obscure subtle failures of generalization.
A plausible implication is that fine-grained or axis-specific evaluation protocols are necessary to fully characterize and stress-test the visual robustness of pixel-based RL agents (Cherepanov et al., 20 Jan 2026).
6. Integration with RL Algorithms
The KAGE-Env API supports plug-and-play integration with standard RL training loops, such as PPO. Example protocol:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from kage_bench import KAGE_Env, load_config_from_yaml config = load_config_from_yaml("train_bg_black.yaml") env = KAGE_Env(config) N_env = 256 reset = jit(vmap(env.reset)) step = jit(vmap(env.step)) keys = random.split(random.PRNGKey(seed), N_env) obs, info = reset(keys) states = info["state"] for update in range(num_updates): for t in range(T_hor): logits = policy_apply(params, obs/255.0) actions = sample_actions(logits, keys) obs2, rews, dones, truncs, info2 = step(states, actions) states = info2["state"] obs = obs2 params = ppo_update(params, buffer) |
Hyperparameters adhere to common benchmarks (CleanRL/PPO-CNN: 128 envs, 128 steps, learning rate , batch size , 3 epochs, clip 0.2). The user consults the Appendix of (Cherepanov et al., 20 Jan 2026) for precise experimental details, configuration files, and learning curves.
7. Significance and Implications for RL Research
Systematic visual generalization evaluation, as instantiated in KAGE-Bench, enables the field to decouple and rigorously attribute generalization failures to specific visual axes, free from confounding changes in underlying behavior or reward. The KAGE paradigm both accelerates empirical iteration (by orders of magnitude) and clarifies the visual robustness landscape for pixel policy learning. It provides a clean testbed for architectural, algorithmic, and augmentation proposals to close the visual generalization gap in RL, and supports both fast and reproducible sweeps over high-dimensional visual parameter spaces (Cherepanov et al., 20 Jan 2026).