CityX: Controllable Procedural Urban Generation
- The paper demonstrates a unified framework leveraging multi-agent LLM orchestration and a universal Blender plugin protocol to generate detailed 3D urban scenes.
- It achieves high executability (94%) and success (83%) rates, ensuring visual fidelity and strict user-driven constraint adherence in city synthesis.
- CityX integrates multi-modal inputs such as OSM data, semantic maps, and satellite imagery to produce simulation-ready, semantically rich urban environments.
CityX: Controllable Procedural Content Generation
CityX is a unified framework for controllable procedural content generation (PCG) targeting unbounded, high-fidelity, and semantically rich 3D urban environments. Designed to meet the needs of simulation-ready cities—serving research in embodied intelligence, robotics, planning, and large-scale simulation—CityX integrates a multi-agent orchestration of LLMs with a universal PCG plugin protocol, delivered over a robust management layer that can ingest multi-modal instructions (e.g., OSM, semantic maps, satellite images) and produce visually and structurally rational cityscapes with precise user and task-driven constraints (Zhang et al., 24 Jul 2024).
1. High-Level System Architecture
CityX centers on two architectural pillars: (a) a universal PCG Management Protocol and (b) a multi-agent orchestration framework tightly coupled to Blender via Python APIs. The PCG Management Protocol encapsulates an extensible suite of Blender plugins as “action functions,” each described by a five-field signature—name, description, input, limitation, and executable function. At runtime, these encapsulations are indexed in a registry (typically JSON/YAML) and loaded into an addressable API table, supporting uniform invocation and dynamic data–format conversion for inter-plugin compatibility.
The orchestration layer is realized by four autonomous agents operating over a shared message-pool:
- Annotator: Extracts I/O signatures and applies semantic tags to all action functions.
- Planner: Consumes user intent and labeled actions, generating an action sequence (workflow) that synthesizes the city in a stepwise manner.
- Executor: Resides in Blender, invoking encapsulated Python actions with composed arguments, maintaining scene state transitions Sₜ → Sₜ₊₁.
- Evaluator: Employs GPT-4V to render the Blender view, compare the intermediate output to the sub-task goals, and provide iterative visual feedback for plan refinement (Zhang et al., 24 Jul 2024).
The typical workflow ingests multi-modal inputs, registers actions, plans the execution sequence, and iteratively synthesizes and evaluates the scene until all user-driven and structural constraints are satisfied.
2. PCG Plugin Protocol and Urban Assembly
Each Blender-side plugin is converted into an action function Sᵢ with fields {name, description, input, limitation, run}, supporting strict parameterization and API introspection. These functions are registered at load-time, and dynamic data–format conversion is performed as needed using a curated library of conversion primitives (e.g., Point_to_face_conversion, Line_to_face_conversion, Cube_generation, Asset_placement). The registry enables seamless inter-operation of plugins that expect, e.g., differing mesh or geometry representations.
CityX’s urban assembly leverages governing empirical algorithms:
- OSM XML files are parsed to surface parcel faces and road networks (G = (V, E)).
- Polygonal city blocks are generated from these faces.
- Building assets are scattered along block centroids or parcel edges, with density typically determined by user-driven constraints.
- CLIP-embedded matching enables semantic asset retrieval for block faces based on textual or visual descriptions. For each block, user or context descriptions are CLIP-encoded and the top-matching assets are sampled from the database (Zhang et al., 24 Jul 2024).
Road-network triangulation combines noise-driven subdivisions and global shortest-path solvers to guarantee connective urban layouts.
3. Multi-Modal Instruction Translation and Program Synthesis
CityX is designed as an instruction-to-program translation engine operating over multi-modal and semantically structured user input. Supported modalities include natural language, OSM geometry, semantic segmentation maps, and overhead satellite imagery. A multi-modal preprocessor converts these to appropriate geometric or labeled intermediate representations. For example, semantic maps are raster-to-point-cloud transformed, and OSM files yield road graphs and parcel delineations.
The Planner implements a function mapping instruction I to an executable action workflow, where each action is parameterized and inserted based on the semantic labels from the Annotator and subject to constraint satisfaction checks. Each action’s execution is tracked, and failures or visual mismatches (as judged by the Evaluator using GPT-4V) prompt replanning, correction, or dynamic data–format conversions as required (Zhang et al., 24 Jul 2024).
4. Multi-Agent Negotiation, Optimization, and Visual Feedback
Agents interact via a persistent message-pool (in-memory or Redis-based), posting outputs (action signatures, plan steps, state summaries, critiques). The Planner’s local objective is logical coherence relative to user intent, as measured by the likelihood ; the Executor aims to maximize executability rate; the Evaluator minimizes perceptual losses between rendered images and subtask objectives.
At each plan step, the Evaluator renders the active Blender viewport and queries GPT-4V with both the current image and corresponding subtask goal. If the output is unsatisfactory, a structured critique is posted back, and the Planner may backtrack or inject additional corrective steps or parametric adjustments. This tightly-coupled, human-in-the-loop feedback scheme is a crucial differentiator for CityX’s ability to achieve both high fidelity and constraint adherence (Zhang et al., 24 Jul 2024).
5. Fine-Grained Controllability and Constraint Satisfaction
All user- and task-driven constraints—zoning, density, typology, stylistic policies, spacing—are first-class citizens and parsed out of the initial instruction. The control variables are enforced such that, for each city :
,
where represents either differentiable constraints (e.g., density matches), or discrete logical requirements (e.g., style tags w.r.t. zone). When constraints are violated, corrective actions (asset removal, rescaling) are automatically interleaved in the workflow. This approach supports continuous and discrete constraint types, enabling both implicit (parametric) and explicit (action-based) control over city synthesis outcomes (Zhang et al., 24 Jul 2024).
6. Quantitative Evaluation and Results
CityX is benchmarked using:
- Executability Rate (ER@1): Fraction of action steps that successfully execute in Blender.
- Success Rate (SR@1): Fraction of executed steps approved by the Evaluator (visual criteria).
- Aesthetic and Rationality Scores: 1–5 Likert scales from human evaluators.
Reported metrics for CityX (GPT-4 backend): ER@1=94%, SR@1=82.98%, Aesthetic=4.30, Rationality=4.35; the nearest baseline (SceneX) achieves ER@1≈78%, SR@1≈61% (Zhang et al., 24 Jul 2024). Qualitative metrics (Figures 1, 4, 5) establish CityX’s capability for both unbounded city synthesis (from OSM/satellite) and fine-grained, style-constrained editing.
A plausible implication, given such performance and architecture, is scalability to infinite city extents and dynamic user-driven reparametrization, contingent on the completeness of the plugin library and the strength of the LLM and visual evaluator loop.
7. Context, Comparative Systems, and Research Position
CityX’s architecture is situated at the intersection of:
- LLM-based zero-shot parameterization and Actor–Critic dual-agent approaches to PCG (Her et al., 11 Dec 2025).
- Modular PCG libraries (e.g., MCG for voxel maps), whose top–down pipeline and dual low/high-level scene representations inform CityX’s plugin protocol (Pyarelal et al., 2021).
- Agent-based procedural city modeling, employing patch-based behavioral simulation, multi-scale parametric constraints, and zoning policy, which underpins CityX's constraint architecture (Lechner et al., 25 Jul 2025).
- Adversarial RL paradigms with auxiliary control for trajectory-diversity and difficulty scaling; elements of these adversarial loops can be adapted for parametrization in city-scale PCG (Gisslén et al., 2021).
- Controllable PCG via behavior trees, enabling modular, interpretable, and parameter–propagating workflow compositions; CityX’s workflow-grammar composition is isomorphic to BT-based approaches (Sarkar et al., 2021).
- Two-grammar frameworks for semantically augmented crowd and city PCG, crucial for downstream embodied intelligence research (Rogla et al., 2018).
CityX consolidates these trends by offering a universal management protocol for plugin PCG within a multi-agent, LLM-driven architecture, incorporating visual feedback loops for iterative refinement and strict user-guided constraint satisfaction, thereby establishing a new benchmark for controllable, semantically rich, and simulation-ready urban PCG (Zhang et al., 24 Jul 2024).