SR-Platform: An Agentic Pipeline for Natural Language-Driven Robot Simulation Environment Synthesis

Published 14 May 2026 in cs.RO | (2605.14700v1)

Abstract: Generating robot simulation environments remains a major bottleneck in simulation-based robot learning. Constructing a training-ready MuJoCo scene typically requires expertise in 3D asset modeling, MJCF specification, spatial layout, collision avoidance, and robot-model integration. We present SR-Platform, a production-deployed agentic system that converts free-form natural language descriptions into executable, physically valid MuJoCo environments. SR-Platform decomposes scene synthesis into four stages: an LLM-based orchestrator that converts user intent into a structured scene plan; an asset forge that retrieves cached assets or generates new 3D geometry through LLM-to-CadQuery synthesis; a layout architect that assigns object poses and verifies industrial constraints; and a bridge layer that assembles the final MJCF scene and merges the selected robot model. The system is deployed as a nine-service Docker stack with WebSocket progress streaming, MinIO-backed mesh storage, Qdrant-based semantic asset retrieval, Redis job state, and InfluxDB telemetry. Using 30 days of production telemetry covering 611 successful LLM calls, SR-Platform generates five-object scenes with a median end-to-end latency of approximately 50 s, while cache-accelerated scenes complete in approximately 30-40 s. The asset forge shows an 11.3% first-attempt retry rate with automatic recovery, and cached asset retrieval removes per-object LLM calls for previously generated object types. These results show that agentic scene synthesis can reduce the manual effort required to create diverse robot training environments, enabling users to produce executable MuJoCo scenes from plain English prompts in under one minute.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents an agentic pipeline that translates free-form natural language into executable MJCF robot simulation scenes, addressing scalability and automation challenges.
It details a four-layer architecture—comprising orchestrator, asset forge, layout architect, and MJCF bridge—that ensures efficient, cache-first asset generation and reliable spatial layout.
Empirical evaluations demonstrate improved latency and robust asset retrieval through cache-first strategies, validating the system's potential for scalable robot learning simulations.

SR-Platform: Agentic Natural Language Scene Synthesis for Robot Simulation

Motivation and Background

SR-Platform addresses the persistent impediment in robot learning: scalable generation of diverse, physically valid simulation environments from unconstrained natural language specifications. Traditional pipelines for MuJoCo scene creation demand proficiency in 3D asset modeling, mesh conversion, MJCF definition, spatial arrangement, collision validation, and model integration, constraining environment diversity and limiting policy robustness. Previous approaches in simulation environment generation (e.g., ThreeDWorld, BEHAVIOR-1K, RoboGen) are either manually intensive, bound to fixed object libraries, or stop short of producing executable physical scenes. There is a critical unsolved gap—direct synthesis of MJCF-ready scenes from free-form language without manual intervention.

LLMs have shown utility in reward design, task planning, and code synthesis for robotics, but prior systems do not realize end-to-end environment generation, instead relying on symbolic descriptions or fixed asset sets. SR-Platform uniquely operationalizes agentic workflows that parse, retrieve, generate, validate, and physically assemble robot simulation environments directly from natural language, aiming to make scene authoring accessible at scale and suitable for industrial deployment.

System Architecture

SR-Platform decomposes scene synthesis into a structured four-layer agentic pipeline:

L1: Orchestrator: Parses the user’s unconstrained prompt into an auditable, structured scene plan (room geometry, object labels, robot selection, anomaly flags), leveraging LLMs for semantic interpretation but enforcing rigorous JSON schema for downstream consumption.
L2: Asset Forge: Resolves each symbolic object into a simulator-compatible mesh via cache-first semantic retrieval (Qdrant vector DB) or, upon cache miss, LLM-to-CadQuery code synthesis. Generated geometry is validated, retried if malformed (observed 11.3% retry rate), stored (MinIO), and re-indexed for future retrieval, minimizing redundant LLM invocation.
L3: Layout Architect: Assigns spatial poses using an LLM-driven layout model, enforcing industrial constraints (electrical clearance, egress, safety, ventilation) via domain-inspired rule checking. Violations yield structured diagnostics, allowing resampling, user notification, or controlled anomaly injection as required.
L4: Bridge (MJCF Assembly): Deterministically assembles the complete MJCF scene, integrating generated/retrieved meshes, global settings, robot models, and spawn coordinates. Scene state is cached per user and rendered via MuJoCo WASM in-browser.

The deployment utilizes a horizontally scalable, nine-service Docker stack with real-time WebSocket progress streaming, asynchronous job queuing (ARQ), telemetry/observability via InfluxDB, and persistent metadata (PostgreSQL, Redis).

Experimental Evaluation and Numerical Results

Telemetry from 30 days of production (611 successful LLM calls) yields strong operational quantitative insights:

End-to-end latency: Five-object scenes with full cache misses synthesize in median ~50 s; eight-object scenes require ~68 s due to batch-limited worker concurrency. Full cache hits reduce typical latency to 30–40 s, validating the cache-first architectural motivation.
Per-stage latency: L2 asset generation (LLM-to-CAD) dominates runtime, with cache-miss median call latency at 17.9 s (p95 = 65.4 s). Retry calls exhibit elevated latency (median 24.4 s). L1 orchestration + L3 layout median latency is 15.9 s (p95 = 73.9 s).
Asset reliability: 11.3% first-attempt retry rate in LLM-generated CadQuery code, with robust automatic recovery/fallback to simplified geometry maintaining pipeline resilience.
System throughput: Worker concurrency supports five simultaneous jobs, queueing up to 100. As asset library grows (semantic cache), throughput increases without further scaling.
Mesh fidelity benchmarking: On standard_100 (precision mechanical CAD), best cost-performance routing is qwen3-coder-480b (score 83.1/100, latency 4.3 s, CD_med 0.16); for abstract_45 (free-form shapes), claude-opus-4-7 leads (score 85.1, latency 6.7 s, CD_med 6.32), with open-weight models competitive but at lower geometric fidelity.

Core Functionalities and Features

SR-Platform integrates production-oriented, user-facing capabilities:

Asset Studio: Interactive generation, preview, and library-based reuse of text-driven 3D assets; semantic retrieval bypasses expensive generation calls.
Prompt Refinement and NL Editing: Multi-turn chat module for prompt underspecification resolution; supports object-level scene edits via NL commands, eliminating the need for direct MJCF manipulation.
Robot Catalog and Merging: Supports insertion of canonical robot models (e.g., TurtleBot, UR5, Franka Panda) into synthesized environments, maintaining coordinate consistency and MuJoCo compatibility.
Anomaly Injection and Dataset Export: Controlled constraint violations for robustness training; complete export (MJCF, meshes, textures) for downstream dataset pipelines.
Authentication, Persistence, Observability: JWT auth, RBAC, per-user scene isolation, persistent job state, and comprehensive telemetry for LLM latency, retry behavior, and throughput.

Implications and Future Directions

The SR-Platform architecture demonstrates that agentic, modular pipelines can offer scalable, accessible environment synthesis, minimizing the prerequisite expertise in geometry, mesh conversion, and spatial modeling. The cache-first asset strategy yields compounding deployment advantages—each generated mesh enriches retrievability, accelerating subsequent generation and reducing LLM dependence.

The system's reliability is bounded by LLM-to-CAD generative error rates, necessitating robust validation and retry/fallback strategies; future versions will refine error taxonomy (syntax, geometry, scale, topology, semantics) and specialize recovery routes.

Industrial constraint checking before MJCF assembly ensures physical plausibility, aligning scenes closer to real-world deployment constraints but not substituting formal compliance (NEC/NFPA/ISO/ASHRAE). Constraint visibility is prioritized for dataset realism.

SR-Platform currently targets indoor, semi-structured manipulation environments; organic, deformable, and visually complex objects or large-scale environments remain challenging. Further integration with image-to-3D pipelines, high-fidelity backends (Newton, Isaac), and robot-policy generation workflows is planned, with direct export to RL dataset formats (LeRobot, RLDS) in scope.

Conclusion

SR-Platform (2605.14700) operationalizes an agentic, modular pipeline for scalable robot simulation environment synthesis from natural language. By utilizing separated orchestration, cache-aware asset generation, constraint-driven spatial layout, and deterministic MJCF assembly, it renders executable simulation scenes for robot learning workflows in under one minute from unconstrained English prompts. Empirical benchmarks validate cache-first latency reduction, robust asset generation, and mesh fidelity across object classes. The infrastructure and operational telemetry support practical, industrial simulation dataset pipelines, laying the foundation for future advances in synthetic environment and training data generation for embodied AI and robotic policy learning.