Papers
Topics
Authors
Recent
2000 character limit reached

AgentSGEN: Multi-Agent LLM in the Loop for Semantic Collaboration and GENeration of Synthetic Data (2505.13466v1)

Published 7 May 2025 in cs.AI

Abstract: The scarcity of data depicting dangerous situations presents a major obstacle to training AI systems for safety-critical applications, such as construction safety, where ethical and logistical barriers hinder real-world data collection. This creates an urgent need for an end-to-end framework to generate synthetic data that can bridge this gap. While existing methods can produce synthetic scenes, they often lack the semantic depth required for scene simulations, limiting their effectiveness. To address this, we propose a novel multi-agent framework that employs an iterative, in-the-loop collaboration between two agents: an Evaluator Agent, acting as an LLM-based judge to enforce semantic consistency and safety-specific constraints, and an Editor Agent, which generates and refines scenes based on this guidance. Powered by LLM's capabilities to reasoning and common-sense knowledge, this collaborative design produces synthetic images tailored to safety-critical scenarios. Our experiments suggest this design can generate useful scenes based on realistic specifications that address the shortcomings of prior approaches, balancing safety requirements with visual semantics. This iterative process holds promise for delivering robust, aesthetically sound simulations, offering a potential solution to the data scarcity challenge in multimedia safety applications.

Summary

  • The paper introduces a dual-agent LLM framework that decouples semantic planning from low-level scene editing to generate safety-aligned synthetic 3D scenes.
  • It leverages context enrichment to transform natural language goals into structured constraints, ensuring precise collision checking and scene realism.
  • Both human and automatic evaluations show that AgentSGEN outperforms baseline methods in semantic fidelity and safety compliance.

AgentSGEN: Multi-Agent LLM-in-the-Loop for Semantic 3D Scene Generation and Synthetic Data

Motivation and Problem Setting

The scarcity of real-world data for safety-critical scenarios, such as blocked emergency exits in indoor environments, presents a significant challenge for training and evaluating AI systems in domains like construction safety. Ethical and logistical constraints preclude the collection of such data at scale, necessitating robust synthetic data generation frameworks. Existing 3D scene generation methods, including manual design, procedural algorithms (e.g., Infinigen), and LLM-guided pipelines (e.g., Holodeck), either lack semantic controllability or fine-grained editability, limiting their utility for simulating nuanced safety violations. Figure 1

Figure 1: Comparison of 3D scene generation methods.

AgentSGEN addresses these limitations by introducing a cognitively inspired, multi-agent system that leverages LLMs for iterative, semantically controlled scene editing. The framework is designed to generate synthetic 3D scenes that are both visually plausible and precisely aligned with user-specified safety constraints, supporting downstream applications in computer vision and robotics.

System Architecture and Methodology

AgentSGEN operationalizes the Dual Process Theory (DPT) by decoupling high-level semantic reasoning from low-level scene manipulation. The architecture comprises two specialized agents:

  • Evaluator Agent (System 2): Implements deliberate, reasoning-intensive planning, formulating action sequences to satisfy semantic and safety constraints.
  • Editor Agent (System 1): Executes atomic scene modifications with low latency, focusing on precise object placement and manipulation.

The workflow is structured into three stages: context enrichment, semantic planning, and interactive scene editing. Figure 2

Figure 2: Initial scene from Holodeck with corresponding scene graph and SGRender views.

Context Enrichment and Constraint Generation

Given a high-level natural language goal (e.g., "a bedroom with doors blocked by large objects"), a lightweight LLM generates a structured set of constraints, including collision, spatial, safety, and goal-specific requirements. The initial context bundle comprises the symbolic scene graph, visual projections (2D/3D), and the constraint set, serving as input to both agents. Figure 3

Figure 3: Multi-modal context components passed to both agents. The context bundle consists of (i) goal-specific constraint requirements, (ii) symbolic scene graph, and (iii) 2D/3D renderings from SGRender.

Semantic Planning (Evaluator Agent)

The Evaluator Agent, instantiated with a high-capacity LLM and chain-of-thought prompting, ingests the context bundle and produces an ordered action plan. Each action specifies an object, an operation (move, rotate, delete), and parameters. The agent performs constraint satisfaction, hypothetical reasoning (e.g., collision forecasting), and global optimization for semantic alignment. The plan is then passed to the Editor Agent.

Interactive Scene Editing (Editor Agent)

The Editor Agent, implemented with a non-reasoning LLM, executes the Evaluator's plan stepwise, updating the scene graph and visualizations after each action. The agent operates under strict constraints, ensuring compliance with collision and spatial rules. The separation of planning and execution enables both semantic depth and operational efficiency. Figure 4

Figure 4: AgentSGEN architecture showcasing semantic planning by the Evaluator Agent and iterative scene editing by the Editor Agent.

Scene Graph as Interactive Environment

The scene graph functions as both a symbolic and interactive environment, supporting AABB-based collision checking. The system can toggle between collision-aware and collision-disabled modes, allowing controlled violations of physical realism when required by the goal (e.g., intentionally blocking a door). Feedback is provided to the agents for reactive correction.

Final Rendering and Dataset Generation

Upon completion and validation of all constraints, the finalized scene graph is rendered using a high-fidelity engine (e.g., AI2-THOR), producing RGB images, segmentation masks, depth maps, and object-level metadata. This enables the generation of large-scale, annotated synthetic datasets tailored to safety-critical scenarios. Figure 5

Figure 5: Final rendering and synthetic dataset generation using AI2-THOR. Output includes RGB, depth, segmentation, and annotated metadata.

Evaluation and Results

AgentSGEN was evaluated on 53 indoor scene types from the MIT Indoor Scenes dataset, with the primary goal of generating rooms where doors are blocked by large objects. The evaluation compared three configurations: Holodeck baseline, AgentSGEN with collision checking, and AgentSGEN without collision checking.

Human Evaluation

Two tasks were conducted: binary task completion (preference for goal satisfaction) and goal-oriented Likert-scale scoring (effectiveness, arrangement, scale appropriateness). Inter-annotator agreement (Cohen’s kappa = 0.406) indicated moderate reliability.

  • Binary Task Completion: The collision-aware AgentSGEN was preferred in 38 out of 53 trials, while Holodeck was selected only 6 times, demonstrating a strong advantage in goal satisfaction. Figure 6

    Figure 6: Binary Task Completion for both human evaluation and automatic evaluation setup.

    Figure 7

    Figure 7: Confusion matrix of binary preference judgments under two collision settings. With collision checking enabled, Cohen’s kappa 0.406 indicates moderate agreement, while disabling collision checking reduces agreement to κ=0.151\kappa = 0.151, suggesting that physical plausibility improves annotator consistency.

  • Likert-Scale Scores: AgentSGEN with collision checking achieved the highest average scores (>4.5/7) across all semantic criteria, while Holodeck consistently scored near the minimum. Figure 8

    Figure 8: Average Likert scores (1–7 scale) for three goal-oriented evaluation questions. Our collision-aware method outperforms both the collision-disabled version and the Holodeck baseline across all metrics.

Automatic LLM-Based Evaluation

GPT-4.1 and Gemini 2.5 Pro were used for discrete choice experiments, consistently preferring AgentSGEN-edited scenes, especially with collision checking enabled (38/53 for GPT-4.1). However, LLMs showed a tendency to favor Holodeck in arrangement and scale, likely due to visual regularity biases. Figure 9

Figure 9: Confusion matrix of LLM judgments under two collision settings.

Figure 10

Figure 10: DCE results from GPT-4.1 and Gemini for the three goal-oriented questions.

Analysis

AgentSGEN demonstrates that embedding semantic reasoning and constraint enforcement within the generation loop yields scenes that are both functionally aligned with safety objectives and visually plausible. The dual-agent design enables fine-grained, context-aware editing, outperforming procedural and one-shot LLM-based methods in both human and LLM-based evaluations. The results highlight that semantic and perceptual objectives can be jointly optimized without trade-off, provided that symbolic reasoning and reactive control are tightly integrated.

Limitations and Future Directions

While AgentSGEN achieves strong performance, the evaluation is limited by the number of human annotators and the scope of scene types. LLM-based evaluation, while promising, exhibits biases toward visual regularity and may not fully capture goal-conditioned semantics. Future work should explore scaling the framework to more complex environments, integrating richer physical simulation, and leveraging multi-modal LLMs for enhanced visual-semantic alignment. The modular architecture supports extension to other domains, such as robotics, urban planning, and healthcare.

Conclusion

AgentSGEN introduces a cognitively grounded, multi-agent LLM-in-the-loop framework for semantically controlled synthetic 3D scene generation. By decoupling high-level reasoning from low-level execution, the system achieves fine-grained spatial control and robust satisfaction of complex safety constraints, outperforming existing procedural and LLM-based baselines. The approach is validated through rigorous human and automatic evaluation, demonstrating its potential as a foundation for scalable, human-aligned synthetic data generation in safety-critical AI applications.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Below is a single, actionable list of the paper’s knowledge gaps, limitations, and open questions that future work could address:

  • No geometry-based, quantitative metric for “blocked exit” verification (e.g., door occlusion percentage, clearance thresholds, navigability/path existence via navmesh or shortest-path to exits).
  • Missing formal handling of constraint conflicts (e.g., intentionally blocking exits vs. maintaining safety rules): define hard/soft constraints, priorities, and explicit conflict-resolution policies with explainability.
  • Lack of ablation studies to justify the dual-agent design: compare against single-agent baselines, different role splits, or planner-only/executor-only configurations to quantify gains in accuracy, speed, and stability.
  • Limited baseline comparisons: evaluate against scenario languages (e.g., Scenic), alternative LLM-guided 3D pipelines, and controllable generation systems (ControlNet, GLIGEN, Infinigen editing) under identical goals.
  • Scalability and cost unreported: measure and report iteration counts, token usage, runtime per scene, throughput, failure rates, and convergence statistics across scene complexity levels.
  • Generalization beyond “blocked doors” is untested: extend to diverse safety-critical scenarios (trip hazards, fall protection, hazardous materials placement, equipment collisions, structural instability, poor lighting).
  • Physics realism is minimal (AABB only): incorporate rigid-body physics, gravity, friction, stability, contact constraints, and collision resolution to avoid visually plausible but physically impossible placements.
  • Egress and human movement validation absent: assess evacuation feasibility (e.g., egress time, path widths, bottlenecks) using agent-based simulation or code-compliant path-planning.
  • Dataset release details are unclear: specify size, splits, diversity across room types, asset licenses, metadata schema, and reproducibility artifacts; commit to public availability and versioning.
  • Reproducibility is under-specified: pin LLM models/versions, prompts, seeds, temperatures, and code; provide end-to-end scripts to reproduce scenes and evaluations.
  • Robustness to prompt variation and adversarial phrasing is not assessed: perform sensitivity analyses across paraphrases, ambiguous goals, and multi-objective prompts.
  • LLM dependency risks (hallucination, bias, non-determinism) are not mitigated: add guardrails (self-consistency checks, verifier modules, tool-augmented reasoning) and audit failure/edge cases.
  • Editor action space is inconsistently described: clarify and evaluate the need for operations beyond move/rotate/delete (e.g., add objects, scale, material changes) and their impact on task success.
  • 2D projection-centric reasoning may miss 3D constraints: quantify errors from using top-down views (stacking, occlusion, vertical clearances, multi-level spaces) and explore full 3D reasoning.
  • Automatic LLM evaluation is biased toward aesthetics: devise calibrated, goal-conditioned evaluators or non-LLM metrics; paper prompt-order effects and reliability across models.
  • Realism and aesthetic quality lack objective metrics: include perceptual realism measures, structural regularity metrics, or larger-scale human preference tests with varied demographics.
  • No policy for when to toggle collision checking: learn or define criteria to switch modes based on goals, ensuring realism while achieving semantic constraints.
  • Engine and asset generalization is untested: validate across AI2-THOR, Habitat, Unity, Unreal, different asset libraries, and domain gaps (e.g., BIM-linked assets).
  • Downstream utility is not demonstrated: show that the synthetic data improves training of detectors/segmenters (e.g., blocked-exit detection) with mAP/F1/IoU on real-world benchmarks.
  • Constraints are not formally guaranteed: integrate CSP/SMT solvers or optimization backends with LLM planning to provably satisfy constraints; provide correctness checks and certificates.
  • Human evaluation scale and expertise are limited: recruit more annotators, include safety/code experts, report inter-rater reliability at scale, and analyze expertise effects on judgments.
  • Failure modes are not characterized: catalog cases where scenes appear blocked but remain traversable; analyze common error patterns and implement automatic detection/remediation.
  • Token/context scaling issues are unaddressed: propose compact scene graph encodings, hierarchical prompting, or retrieval to handle large-object scenes within LLM context limits.
  • No uncertainty or confidence reporting: add quantitative confidence scores for constraint satisfaction, plan validity, and execution outcomes; expose introspective signals for auditing.
  • Ethical and cultural realism considerations are missing: assess whether generated hazards could mislead or reinforce biases; ensure responsible release and usage guidelines.
  • Temporal dynamics are ignored: extend to time-varying hazards, multi-step scenario evolution, and evacuation simulations to evaluate dynamic safety violations.
  • Annotation fidelity is assumed: validate the correctness of segmentation/masks/depth/metadata against scene graphs; measure label noise and its impact on downstream models.
  • Alternative multi-agent collaboration modes are unexplored: compare cooperation, competition, role reassignment, and team optimization (e.g., DyLAN) for performance and robustness.
  • Visual feedback design is not evaluated: measure the impact of different render modalities (top-down, multi-view, overlays, trajectories) on agent performance and error reduction.
  • Termination criteria and convergence guarantees are unspecified: define stopping rules, maximum iterations, and theoretical/empirical convergence properties under varied constraints.
  • Domain-specific safety knowledge is not formalized: integrate building codes (e.g., NFPA, OSHA), represent them symbolically, and test compliance; assess cross-region code generalization.
  • Multi-objective composition is unexplored: develop mechanisms to satisfy multiple, potentially conflicting goals (e.g., blocked main exit but clear secondary exit, accessibility constraints) with tunable trade-offs.
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com