- The paper presents an agentic pipeline that translates free-form natural language into executable MJCF robot simulation scenes, addressing scalability and automation challenges.
- It details a four-layer architecture—comprising orchestrator, asset forge, layout architect, and MJCF bridge—that ensures efficient, cache-first asset generation and reliable spatial layout.
- Empirical evaluations demonstrate improved latency and robust asset retrieval through cache-first strategies, validating the system's potential for scalable robot learning simulations.
Motivation and Background
SR-Platform addresses the persistent impediment in robot learning: scalable generation of diverse, physically valid simulation environments from unconstrained natural language specifications. Traditional pipelines for MuJoCo scene creation demand proficiency in 3D asset modeling, mesh conversion, MJCF definition, spatial arrangement, collision validation, and model integration, constraining environment diversity and limiting policy robustness. Previous approaches in simulation environment generation (e.g., ThreeDWorld, BEHAVIOR-1K, RoboGen) are either manually intensive, bound to fixed object libraries, or stop short of producing executable physical scenes. There is a critical unsolved gap—direct synthesis of MJCF-ready scenes from free-form language without manual intervention.
LLMs have shown utility in reward design, task planning, and code synthesis for robotics, but prior systems do not realize end-to-end environment generation, instead relying on symbolic descriptions or fixed asset sets. SR-Platform uniquely operationalizes agentic workflows that parse, retrieve, generate, validate, and physically assemble robot simulation environments directly from natural language, aiming to make scene authoring accessible at scale and suitable for industrial deployment.
System Architecture
SR-Platform decomposes scene synthesis into a structured four-layer agentic pipeline:
- L1: Orchestrator: Parses the user’s unconstrained prompt into an auditable, structured scene plan (room geometry, object labels, robot selection, anomaly flags), leveraging LLMs for semantic interpretation but enforcing rigorous JSON schema for downstream consumption.
- L2: Asset Forge: Resolves each symbolic object into a simulator-compatible mesh via cache-first semantic retrieval (Qdrant vector DB) or, upon cache miss, LLM-to-CadQuery code synthesis. Generated geometry is validated, retried if malformed (observed 11.3% retry rate), stored (MinIO), and re-indexed for future retrieval, minimizing redundant LLM invocation.
- L3: Layout Architect: Assigns spatial poses using an LLM-driven layout model, enforcing industrial constraints (electrical clearance, egress, safety, ventilation) via domain-inspired rule checking. Violations yield structured diagnostics, allowing resampling, user notification, or controlled anomaly injection as required.
- L4: Bridge (MJCF Assembly): Deterministically assembles the complete MJCF scene, integrating generated/retrieved meshes, global settings, robot models, and spawn coordinates. Scene state is cached per user and rendered via MuJoCo WASM in-browser.
The deployment utilizes a horizontally scalable, nine-service Docker stack with real-time WebSocket progress streaming, asynchronous job queuing (ARQ), telemetry/observability via InfluxDB, and persistent metadata (PostgreSQL, Redis).
Experimental Evaluation and Numerical Results
Telemetry from 30 days of production (611 successful LLM calls) yields strong operational quantitative insights:
- End-to-end latency: Five-object scenes with full cache misses synthesize in median ~50 s; eight-object scenes require ~68 s due to batch-limited worker concurrency. Full cache hits reduce typical latency to 30–40 s, validating the cache-first architectural motivation.
- Per-stage latency: L2 asset generation (LLM-to-CAD) dominates runtime, with cache-miss median call latency at 17.9 s (p95 = 65.4 s). Retry calls exhibit elevated latency (median 24.4 s). L1 orchestration + L3 layout median latency is 15.9 s (p95 = 73.9 s).
- Asset reliability: 11.3% first-attempt retry rate in LLM-generated CadQuery code, with robust automatic recovery/fallback to simplified geometry maintaining pipeline resilience.
- System throughput: Worker concurrency supports five simultaneous jobs, queueing up to 100. As asset library grows (semantic cache), throughput increases without further scaling.
- Mesh fidelity benchmarking: On standard_100 (precision mechanical CAD), best cost-performance routing is qwen3-coder-480b (score 83.1/100, latency 4.3 s, CD_med 0.16); for abstract_45 (free-form shapes), claude-opus-4-7 leads (score 85.1, latency 6.7 s, CD_med 6.32), with open-weight models competitive but at lower geometric fidelity.
Core Functionalities and Features
SR-Platform integrates production-oriented, user-facing capabilities:
- Asset Studio: Interactive generation, preview, and library-based reuse of text-driven 3D assets; semantic retrieval bypasses expensive generation calls.
- Prompt Refinement and NL Editing: Multi-turn chat module for prompt underspecification resolution; supports object-level scene edits via NL commands, eliminating the need for direct MJCF manipulation.
- Robot Catalog and Merging: Supports insertion of canonical robot models (e.g., TurtleBot, UR5, Franka Panda) into synthesized environments, maintaining coordinate consistency and MuJoCo compatibility.
- Anomaly Injection and Dataset Export: Controlled constraint violations for robustness training; complete export (MJCF, meshes, textures) for downstream dataset pipelines.
- Authentication, Persistence, Observability: JWT auth, RBAC, per-user scene isolation, persistent job state, and comprehensive telemetry for LLM latency, retry behavior, and throughput.
Implications and Future Directions
The SR-Platform architecture demonstrates that agentic, modular pipelines can offer scalable, accessible environment synthesis, minimizing the prerequisite expertise in geometry, mesh conversion, and spatial modeling. The cache-first asset strategy yields compounding deployment advantages—each generated mesh enriches retrievability, accelerating subsequent generation and reducing LLM dependence.
The system's reliability is bounded by LLM-to-CAD generative error rates, necessitating robust validation and retry/fallback strategies; future versions will refine error taxonomy (syntax, geometry, scale, topology, semantics) and specialize recovery routes.
Industrial constraint checking before MJCF assembly ensures physical plausibility, aligning scenes closer to real-world deployment constraints but not substituting formal compliance (NEC/NFPA/ISO/ASHRAE). Constraint visibility is prioritized for dataset realism.
SR-Platform currently targets indoor, semi-structured manipulation environments; organic, deformable, and visually complex objects or large-scale environments remain challenging. Further integration with image-to-3D pipelines, high-fidelity backends (Newton, Isaac), and robot-policy generation workflows is planned, with direct export to RL dataset formats (LeRobot, RLDS) in scope.
Conclusion
SR-Platform (2605.14700) operationalizes an agentic, modular pipeline for scalable robot simulation environment synthesis from natural language. By utilizing separated orchestration, cache-aware asset generation, constraint-driven spatial layout, and deterministic MJCF assembly, it renders executable simulation scenes for robot learning workflows in under one minute from unconstrained English prompts. Empirical benchmarks validate cache-first latency reduction, robust asset generation, and mesh fidelity across object classes. The infrastructure and operational telemetry support practical, industrial simulation dataset pipelines, laying the foundation for future advances in synthetic environment and training data generation for embodied AI and robotic policy learning.