Zero-Shot Cross-Game Generalization

Updated 13 November 2025

The paper introduces a framework for zero-shot cross-game generalization using spatial-temporal decomposition and contrastive representation that enables immediate policy transfer across varying environments.
It outlines innovative methodologies including action embeddings, relational inference, and game-invariant vision to handle variations in visual style, layout, reward structures, and dynamics.
Empirical results show significant improvements in transfer efficiency and reduced sample complexity compared to traditional RL methods, validating the robustness of these approaches.

Zero-shot cross-game generalization is the capability of an intelligent agent to learn in one or several games and immediately perform well in unseen, potentially distinct games without further interaction, data collection, or fine-tuning. This property transcends conventional generalization in reinforcement learning (RL), demanding invariance to visual style, layout, reward structure, action semantics, and dynamics, often requiring architectural, representational, and training innovations that isolate transferable structure from non-transferable particulars. The field leverages concepts from spatial–temporal decomposition, relational inference, action embedding, context learning, and contrastive adversarial representation learning, resulting in a growing corpus of methods with empirical evidence for zero-shot transfer across games and domains.

1. Formal Problem Statements and Generalization Regimes

Zero-shot cross-game generalization is typically formalized in families of Markov decision processes (MDPs) where structure (physics, atomic elements, action set, reward semantics) is shared but instantiation (layout, visual style, object types) varies per game. The agent receives either exploratory trajectories, sparse rewards, or off-task action observations in source domains ( $\mathcal{E}_{train}$ ), then is evaluated directly in a novel target domain ( $\mathcal{E}_{test}$ ) without additional interaction.

Distinct regimes are studied:

Trajectory-based transfer: Agent receives only trajectory-level experience (states/actions, sparse terminal reward), no dense reward or environment interaction at training. At test, zero environment interaction is permitted (Xu et al., 2019).
Relational state alignment: Transfer is achieved via analogical mapping of explicit relational structure learned unsupervised from object-level statistics (Doumas et al., 2019).
Action-centric generalization: Policy is designed to operate over arbitrary (and previously unseen) action sets, using auxiliary action embeddings from side observations (Jain et al., 2020).
Contextual inference: Robust context representations are integrated with policy/value functions and learned jointly for zero-shot extrapolation to unseen environmental parameters (Ndir et al., 2024).
Game-invariant vision: Visual encoders are trained to remove game-specific style cues, yielding embeddings that facilitate downstream transfer for novel games (Kline, 22 May 2025).
Cross-trajectory SSL: Encoders are encouraged to cluster behaviorally similar state/action trajectories, reducing reward-overfitting and isolating transferable “situations" (Mazoure et al., 2021).

Each formalism places unique constraints on the class of environments, supervision, and transfer evaluation protocol.

2. Core Methodologies for Zero-Shot Transfer

Several architecture and training paradigms underpin zero-shot cross-game generalization:

Contingency-aware local observations: Global state $s_t$ is cropped to an egocentric window $o_t=W(s_t)$ , further partitioned as $\{W_l(s_t)\}_{l=1}^K$ .
Score model: Neural function $S_\theta$ maps local features and actions to fine-grained pseudo-rewards ( $r_t^l = S_\theta(\varphi_l(o_t), a_t)$ ); aggregated via $J_\theta(\tau) = \sum_{t=1}^T \sum_{l=1}^K S_\theta(\varphi_l(o_t), a_t)$ , trained against sparse terminal reward via $L(\theta)=\frac{1}{2}(J_\theta(\tau)-R_{sparse}(\tau))^2$ .
Forward dynamics model ( $M_\phi$ ): Predicts $\hat o_{t+1}=M_\phi(o_t, a_t)$ for MPC-based planning in unseen environments.
MPC planning: Candidate action sequences are simulated via $M_\phi$ , cumulative pseudo-reward is scored by $S_\theta$ , and the first action of the highest-scoring sequence is executed.

Predicate extraction: Objects and their spatial/temporal relations are encoded as explicit predicate units $P$ via unsupervised comparison; dynamic binding enables flexible role-filler association.
RL over relational states: Policy network consumes sets of predicate-role-filler triples as input; TD updates as in standard RL frameworks.
Analogical transfer: Structured alignment matches relational graphs from source and target games; role bindings and action schemas are mapped, yielding zero-shot policy activation in the target.

Hierarchical VAE: For each action $a_i$ , side-observations $\mathcal{O}_i$ are embedded into a global latent $c_i$ via $q_\phi(c_i|\mathcal{O}_i)$ .
Action-conditioned policy: Given variable action sets $\mathcal{A}$ , policy softmaxes utility scores over $(h_\omega(s), c_i)$ for $a_i \in \mathcal{A}$ .
Regularization: Episodic random subsampling, entropy bonuses, and early stopping prevent overfitting, promote generalization to unseen action subsets.

Behavior-specific context encoder ( $\psi$ ): Learns to infer low-dimensional context $l_c$ from recent transition tuples $L_c = [(s_j, a_j, s'_j)]$ ; context is never revealed to the agent.
Joint SAC loss: Policy $\pi_\theta(a|s, l_c)$ and Q-net $Q_\phi(s, l_c, a)$ are trained with critic/actor objectives whose gradients flow into $\psi$ ; context encoding is thus tailored to RL objectives.

Contrastive objective ( $\mathcal{L}_{con}$ ): Maximizes agreement between augmented views of the same image via InfoNCE loss.
Adversarial domain classifier ( $\mathcal{L}_{dom}$ ): Gradient reversal maximizes classification loss for encoder, suppressing game identity in embedding; joint objective $\mathcal{L}_{total} = \mathcal{L}_{con} + \lambda \mathcal{L}_{dom}$ .
Evaluation: Domain classification accuracy drops to near random (10–15%) after training, t-SNE shows mixing of embeddings across games.

Reward-free encoder training: Encoder clusters sub-trajectories via Sinkhorn-softmax and minimizes cross-cluster prediction error; pseudo-bisimulation emerges in representation space.
Integration with PPO: Standard policy/value heads are updated in parallel; encoder is isolated from reward signals.

3. Experimental Protocols, Benchmarks, and Performance

Experimental validation employs synthetic and game environments with held-out test domains that challenge transfer capabilities.

Method	Setting	Train/Test Protocol	Metric	Performance Highlights
SAP (Xu et al., 2019)	Super Mario Bros, BlockedReacher	Train on 1 level/config, zero-shot test on disjoint layout/game	Avg distance/steps	Mario: Test 790 (vs. BC 350, MBHP 588); Reacher: Test 86–113 (vs. MBHP 102–161)
Relation RL (Doumas et al., 2019)	Breakout $\to$ Pong	Train relational policy in Breakout, analogical transfer to Pong	Paddle-hit rate	Zero-shot Pong 72% (vs. DQN 52%); sample complexity advantage $>20\times$
Action Embedding (Jain et al., 2020)	GridWorld, CREATE, Stacking	Train/test split over action sets, zero-shot test on unseen actions	Success % / height	Test: GridWorld 83%; CREATE Push 88%; Stacking 6.9 (vs. train 7.6)
Context RL (Ndir et al., 2024)	CARL (Cartpole, MountainCar, Ant)	Train/test over context values, no retraining	IQM normalized return	Ant extrapolation: jcpl 1.0635 (vs. predictive 0.9461)
Game-invariant Vision (Kline, 22 May 2025)	Bingsu (10 games, images)	All games in training, evaluate via domain classifier	Domain accuracy	Post-training domain accuracy ≈10–15% (vs. ImageNet 95%, SimCLR 40%)
CTRL (Mazoure et al., 2021)	Procgen (16 games)	Train/eval on disjoint level splits, within-game gen	Mean episodic return	+15% over PPO baseline, significant on 10/16 games

A plausible implication is that both local compositionality (SAP), explicit relational state alignment, and game-invariant representation learning independently yield substantial improvements in zero-shot generalization over traditional RL and behavior cloning methods. Sample complexity reductions and direct transfer without additional interaction are demonstrated.

4. Mechanisms Enabling Cross-Game Invariance and Transfer

Cross-game invariance is supported by several key mechanisms:

Locality and compositionality: Decomposing observations and rewards into object-centric local windows ( $W_l(s_t)$ ), preserving semantics of atomic game elements and enabling recombinability in new layouts (Xu et al., 2019).
Explicit relational abstraction: Learning symbolic predicates with dynamic role binding, allowing flexible policy mapping even across games with divergent surface features (Doumas et al., 2019).
Learned action and context embeddings: Embedding actions from side observations and contexts from historical transitions promotes transfer when semantic overlap is partial or unknown (Jain et al., 2020, Ndir et al., 2024).
Representation learning via invariance: Adversarial suppression of style features combined with contrastive content preservation yields game-invariant visual encoders (Kline, 22 May 2025).
Behavioral similarity clustering: SSL objectives grounded in trajectory clustering induce a reward-free notion of pseudo-bisimulation, circumventing overfitting and isolating transferable "situations" (Mazoure et al., 2021).

Methodologically, alignment of local rewards, structural roles, and embeddings enables transfer of policies, while MPC and analogical inference provide mechanisms for effective action selection in unfamiliar environments.

5. Limitations, Open Challenges, and Extensions

Limitations exist in scope and generalizability:

Extent of cross-domain transfer: Many approaches validate only cross-layout or cross-context transfer within a game or physics family; true cross-game generalization (e.g., Mario $\rightarrow$ Sonic) remains challenging, often for lack of shared semantics or reward structure (Ndir et al., 2024, Mazoure et al., 2021).
Action semantics and state heterogeneity: Transfer between games with mutually exclusive action sets or vastly divergent state spaces is not directly supported; meta-RL and universal encoders are posited as remedies (Jain et al., 2020).
Task and reward variation: Most methods hold reward function fixed or sparsely defined; extension to reward-varying transfer (tasks/goals shifting) is open (Ndir et al., 2024).
Embedding collapse and content loss: Pure adversarial learning degrades the informativeness of embeddings; contrastive objectives are needed to preserve relevant content (Kline, 22 May 2025).
Temporal abstraction: Simple aggregation or mean-pooling of transitions may fail to capture temporal dependencies vital for identifying context or transferable mechanism (Ndir et al., 2024).

Potential extensions indicated in the literature include game-ID factored context encoding, universal action representation, meta-training across diverse games, and permutation-invariant representation learners. Some propose meta-RL "outer loops" to optimize for rapid adaptation, and contrastive regularization across games to align semantically similar actions and states.

6. Relationships to Developmental and Cognitive Transfer

Zero-shot cross-game generalization is paralleled by analogous mechanisms in human cognitive development:

Extraction and binding of relational invariants mirrors the developmental trajectory from perceptual comparison to explicit relation reasoning in children (Doumas et al., 2019).
Immediate analogical transfer emulates the human ability to apply abstracted rules or policies to novel tasks, reducing sample complexity by orders of magnitude.
Compositionality and local reasoning reflect the way biological agents operate with local object affordances and easily recombine learned principles.
Isolation from superficial cues matches cognitive suppression of irrelevant features during analogical mapping.

The fit between computational mechanisms and developmental observations supports the hypothesis that representation and inference over structured, game-invariant entities is essential for robust cross-domain generalization in both natural and artificial agents.

7. Representative Methods, Ablations, and Empirical Findings

Comprehensive ablation studies and baseline comparisons across methods underline the necessity of each major architectural component:

Ablation Removed	Empirical Effect
Spatial reward decomposition	−17% (Mario returns)
Temporal aggregation	Further −23%
Pure contrastive (no adversarial)	Residual clustering by game (not invariant) (Kline, 22 May 2025)
Adversarial only (no contrastive)	Embedding collapse (no content)
Prediction loss (CTRL)	−17% drop in average return (Mazoure et al., 2021)
Clustering loss (CTRL)	−21% collapse in cluster integrity
Action FiLM conditioning	−24% degradation

Statistically significant improvements in zero-shot return, domain classification accuracy, and transfer sample efficiency are consistently documented. In summary, robust zero-shot cross-game generalization depends on deliberate decomposition, structured relational encoding, adversarial and contrastive representation learning, and joint objective optimization. Integrating these principles, recent methods achieve immediate or near-immediate performance on previously unseen games, thereby advancing the frontier of generalizable artificial agents.

PDF Markdown Chat (Pro)

References (6)

Zero-shot Policy Learning with Spatial Temporal RewardDecomposition on Contingency-aware Observation (2019)

A Theory of Relation Learning and Cross-domain Generalization (2019)

Generalization to New Actions in Reinforcement Learning (2020)

Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement Learning (2024)

Game-invariant Features Through Contrastive and Domain-adversarial Learning (2025)

Cross-Trajectory Representation Learning for Zero-Shot Generalization in RL (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Cross-Game Generalization.

Zero-Shot Cross-Game Generalization

1. Formal Problem Statements and Generalization Regimes

2. Core Methodologies for Zero-Shot Transfer

2.1 Spatial–Temporal Reward Decomposition (SAP) (Xu et al., 2019)

2.2 Relational Representation and Analogical Mapping (Doumas et al., 2019)

2.3 Action Embedding Framework (Jain et al., 2020)

2.4 Context Representation Learning (Ndir et al., 2024)

2.5 Contrastive and Domain-Adversarial Vision (Kline, 22 May 2025)

2.6 Cross-Trajectory SSL (CTRL) (Mazoure et al., 2021)

3. Experimental Protocols, Benchmarks, and Performance

4. Mechanisms Enabling Cross-Game Invariance and Transfer

5. Limitations, Open Challenges, and Extensions

6. Relationships to Developmental and Cognitive Transfer

7. Representative Methods, Ablations, and Empirical Findings

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Zero-Shot Cross-Game Generalization

1. Formal Problem Statements and Generalization Regimes

2. Core Methodologies for Zero-Shot Transfer

2.1 Spatial–Temporal Reward Decomposition (SAP) (Xu et al., 2019)

2.2 Relational Representation and Analogical Mapping (Doumas et al., 2019)

2.3 Action Embedding Framework (Jain et al., 2020)

2.4 Context Representation Learning (Ndir et al., 2024)

2.5 Contrastive and Domain-Adversarial Vision (Kline, 22 May 2025)

2.6 Cross-Trajectory SSL (CTRL) (Mazoure et al., 2021)

3. Experimental Protocols, Benchmarks, and Performance

4. Mechanisms Enabling Cross-Game Invariance and Transfer

5. Limitations, Open Challenges, and Extensions

6. Relationships to Developmental and Cognitive Transfer

7. Representative Methods, Ablations, and Empirical Findings

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics