InfGen: Token-Based Scenario & Image Generation

Updated 15 September 2025

InfGen is a token-based approach for dynamic traffic and image generation that discretizes scenarios into sequences for autoregressive prediction.
The framework leverages unified tokenization and transformer architectures to simulate continuous motion, agent interactions, and scene evolution.
Evaluated via metrics like MMD and ADE/FDE, InfGen demonstrates superior performance and robustness, supporting infinite, dynamically evolving simulations.

InfGen refers to a family of approaches, with several independent lines of work using the same or similar names, in the fields of scenario and image generation. Notable usages of "InfGen" are found in (1) long-term traffic and scenario simulation for autonomous driving and (2) resolution-agnostic high-resolution image synthesis. The following survey organizes the key paradigms and innovations under this label.

1. Unified Next-Token Prediction for Traffic Simulation

The InfGen paradigm for traffic simulation (Yang et al., 20 Jun 2025, Peng et al., 29 Jun 2025) reformulates long-term, interactive scenario generation as a single autoregressive next-token (or next-token-group) prediction task, leveraging a transformer-based model to simulate both motion and scene dynamics.

Model Design

Unified Tokenization: The entire traffic scenario—including road map segments, traffic signals, agent states, and agent motion vectors—is discretized into tokens. Specific schemes tokenize continuous trajectories using a k-disks method and represent new agent poses with grid-discretized coordinates and heading intervals.
Dynamic Token Matrix: The full multi-agent, multi-timestep simulation is represented as a dynamic matrix of tokens, with rows for agents and columns for time steps. The agent population is not static but evolves as the scene unfolds via learned control tokens.
Transformer Architecture: Separate modality-specific MLPs embed input tokens, which are then processed via multi-head self-attention (temporal) and cross-attention (agent-agent, map-agent, grid) within each of L transformer layers. The output heads predict motion, agent pose, and control signals.

Interleaved Simulation Process

InfGen alternates between:

Temporal Motion Simulation: Closed-loop rollout over active agents, with temporal, social, and map context processed via attention.
Scene Generation (Agent Insertion): After each motion step, scene generation mode predicts whether new agents need to be added, specifying insertion pose and heading. Learnable agent queries and occupancy grids guide spatial placements.

All switches between these phases are determined by internally predicted control tokens (e.g., <ADD AGENT>, <REMOVE AGENT>, <BEGIN MOTION>, <KEEP AGENT>).

2. Auto-Regressive Scenario Generation as Token Group Prediction

An extended InfGen framework (Peng et al., 29 Jun 2025) further decomposes the scenario into sequential groups of tokens for each simulation step, reflecting both environmental (map, traffic lights) and agent-centric (start-of-agent, type, map anchor, position/heading residuals, motion) aspects.

Structured Token Groups: Each time step is built as a sequence of sub-tokens: traffic light status, agent state (a four-token ordered set), and motion token. This enables injection/removal of agents at any timestep.
Autoregressive Decoding: Given all previous tokens, the model predicts the next group, enabling both teacher-forcing (for logging-based replay) and sampling (for scenario densification or bootstrapped generation).
Infinite Scene Generation: By autoregressively inserting new agents and behaviors and not tying agent population to any single initial condition, InfGen supports open-ended, unbounded simulation without static replay.

3. Evaluation, Performance Metrics, and Downstream Utility

InfGen architectures are evaluated both as generative models (distributional closeness to real-world scenarios) and as scenario generators for downstream reinforcement learning.

Simulation Metrics

Metric	Purpose/Definition	InfGen Finding
MMD (Maximum Mean Discrepancy)	Initial state (map, agent) distribution realism	Matches or exceeds prior art
ADE/FDE	Average/final displacement prediction error	State-of-the-art on short-term
Placement Counts/Distances	Realism of agent insertions/removals in rollouts	Superior long-term stability
Agent Count Error (ACE)	Divergence of agent population over time	Lower error, stable population

Policies trained via RL on InfGen-generated scenarios show improved robustness, lower collision rates, and superior generalization when evaluated on test sets and in more adversarial or diverse conditions compared to policies trained only on fixed or replay-based scenario generators.

4. Mathematical Formulation

The generation process is mathematically defined by the autoregressive factorization over the sequence of tokens x₁:t:

$x_{1:t} = [\mathrm{MAP};\, (\mathrm{TL},\,\mathrm{AS},\,\mathrm{MO})_1;\, (\mathrm{TL},\,\mathrm{AS},\,\mathrm{MO})_2; \ldots ]$

with the model parameterizing

$p_\theta(x_t \mid x_{1:t-1})$

where agent states are tokenized as:

$r_i = (l, w, h, u, v, \delta\psi, v_x, v_y)$

and motion tokens are mapped back to physical updates (e.g., via bicycle model equations):

$\begin{align*} \psi_{t+1} &= \psi_t + \omega \cdot \Delta t \ v_{t+1} &= v_t + a \cdot \Delta t \ x_{t+1} &= x_t + v_{t+1} \cos(\psi_{t+1}) \Delta t \ y_{t+1} &= y_t + v_{t+1} \sin(\psi_{t+1}) \Delta t \end{align*}$

Token group attention and causal masking in the transformer ensure valid information flow and temporal order enforcement.

5. Scenario Diversity, Realism, and "Infinite" Generation

By decoupling agent initialization, allowing dynamic agent injection/removal, and modeling full closed-loop scene evolution as a token sequence, InfGen supports scenarios of arbitrary density, length, and complexity. This "infinite" property holds in the sense that new agents and routes can be injected at any point, and scene evolution is not limited by the length or composition of training data.

Comparison Table: Traditional vs. InfGen Traffic Simulation

Property	Traditional Log-Replay	Fixed-Agent Data-Driven	InfGen (Token-based)
Agent Set	Fixed	Fixed (at init)	Dynamic / Infinite
Scenario	Pre-defined	Static	Autoregressive, evolving
Agent Insert	Log only	Impossible	Arbitrary, on-the-fly
Training	Replay/supervised	Supervised	Supervised/autoregressive
RL Utility	Poor generalization	Mixed	Superior generalization

6. Implementation Nuances and Future Directions

InfGen implementations highlight several challenges and active areas:

Long-Horizon Scalability: As token sequence length and agent count grow, attention computations and memory requirements scale superlinearly. Efficient transformer architectures and dynamic memory management are suggested directions.
Closed-Loop Stability: Compounding errors in long rollouts motivate techniques such as closed-loop fine-tuning and, potentially, reinforcement learning for joint optimization of generative and policy models.
Map and Multi-Modal Integration: Current work uses static map tokens; future work may explore on-the-fly procedural map extensions and multimodal conditioning (e.g., sensor, natural language).
Token Granularity: Discretization levels (e.g., grid vs. continuous positions, coarse vs. fine heading) can be further optimized for performance and fidelity.

7. Significance and Impact

InfGen's paradigm for scenario and image generation enables highly scalable, dynamic, and contextually adaptive simulation and synthesis. In intelligent systems, its ability to produce open-ended, interactive environments supports robust reinforcement learning and planning for autonomous vehicles and agents. In generative modeling, its resolution-agnostic formulation for images (cf. (Han et al., 12 Sep 2025)) allows efficient synthesis at resolutions previously beyond reach for diffusion models, with significant computational advantages.

A plausible implication is that InfGen-style approaches will become the foundation for unified, scalable environment generators—potentially extending beyond autonomous driving to domains including gaming, synthetic data generation, and interactive AI benchmarking.