ScenarioNL: Language to Scenario Conversion

Updated 27 March 2026

ScenarioNL is a method that converts unstructured language into explicit, simulatable scenario representations applicable in domains like autonomous driving and code synthesis.
The framework employs multi-stage pipelines combining NLP preprocessing, semantic extraction, and probabilistic programming to formalize human intent into structured formats.
Applications include automated driving safety, social navigation testing, and scenario-based code synthesis, driving improved simulation fidelity and systematic evaluation.

ScenarioNL refers to the task of generating, detecting, and formalizing scenarios from natural language, with applications ranging from automated driving and robotics to code synthesis and agent evaluation. At its core, ScenarioNL seeks structured, simulatable, or testable scenario representations from natural language descriptions, narratives, or specifications. This capability closes the gap between human intent or documentation, and the formal symbolic or programmatic objects required for large-scale testing, benchmarking, or automation in complex technical domains.

1. Formal Definitions and Scope of ScenarioNL

ScenarioNL encompasses diverse instantiations across domains, unified by the transformation of natural language into explicit, actionable scenario representations. In autonomous driving, ScenarioNL includes the mapping of free-text, user- or engineer-supplied driving condition requests (“I want all cut-in events where the intruding vehicle’s time-to-collision falls below 2 s”) into concrete, labeled trajectories or scene programs that can be replayed in simulators such as CarMaker or Esmini (Zhao et al., 2024). In social navigation, it involves synthesizing detailed human-robot interaction testbeds from high-level context labels and brief event sketches (Marpally et al., 2024). In code domains, it covers the extraction of real-world programming scenarios and requirements from massive corpora, synthesizing graph-structured program-generation challenges (Yao et al., 16 Sep 2025).

Generalizing, ScenarioNL can be formulated as a function

$f: \mathcal{D} \times \mathcal{M} \rightarrow \mathcal{S}$

where $\mathcal{D}$ is the space of unstructured natural language descriptions (possibly with context $\mathcal{M}$ such as scenario metadata or test constraints), and $\mathcal{S}$ is the space of structured scenario programs or labeled scene instances suitable for subsequent simulation, evaluation, or benchmarking.

The scope of ScenarioNL includes:

Automated generation of scenario programs (DSLs, code, or simulation input files) for autonomous systems validation (Elmaaroufi et al., 2024, Safa et al., 24 Feb 2026, Yang et al., 2023, Zhao et al., 2024).
Large-scale annotation or extraction of scenario labels (“scripts”) from narrative text for NLP and cognitive science (Wanzare et al., 2019).
Actual scenario “mining” from multimodal records (e.g., crash reports with associated sketches) (Safa et al., 24 Feb 2026).
Synthesis of code problems that authentically reflect real-world scenario complexity, driven by scenario-centric knowledge graphs (Yao et al., 16 Sep 2025).
Systematic generation of interactive natural language testing sessions to probe LLM-based agentic systems, sampling from scenario-like probabilistic state transition models (Wang et al., 19 Jul 2025).

2. Methodologies and Architectural Patterns

ScenarioNL systems involve multi-stage pipelines tailored to their target domain but generally follow a progression:

Input preprocessing: Natural language text (and optionally sketches, metadata, or other auxiliary modalities) are tokenized and normalized for downstream processing (Safa et al., 24 Feb 2026, Elmaaroufi et al., 2024).
Prompt engineering and structured LLM querying: Carefully crafted prompts guide LLMs to interpret, condense, and formalize user intent or descriptive narratives into machine-interpretable templates, classifications, or programs (Zhao et al., 2024, Yang et al., 2023, Marpally et al., 2024).
Semantic extraction and representation: Outputs are cast into structured forms—label schemas (e.g., activity/position matrices), probabilistic DSLs (e.g., Scenic or Extended Scenic), JSON/YAML templates, or code skeletons. This step often uses few-shot or schema-constrained prompting and may include self-validation loops (Safa et al., 24 Feb 2026, Elmaaroufi et al., 2024).
Rule-based or probabilistic scenario mining/matching: For dataset-driven or scenario-mining instantiations, the system applies mathematical rules over temporal, spatial, and actor state to match candidate events to formalized scenario classes, often using metric thresholds such as Time-to-Collision (TTC) or Post-Encroachment Time (Zhao et al., 2024).
Probabilistic program synthesis: Uncertainty in reports or specifications is encoded as explicit random variables and constraints in programmatic scenario representations, e.g., using distributions in Scenic (Elmaaroufi et al., 2024, Safa et al., 24 Feb 2026, Yang et al., 2023).
Simulatable output generation: The system emits code, DSL programs, or simulator input files (OpenSCENARIO XML, Python, Scenic) for downstream evaluation (Yang et al., 2023, Zhao et al., 2024).
Closed-loop evaluation and correction: Compiler-in-the-loop, error-driven correction, and ground-truth alignment ensure syntactic and semantic validity of generated scenarios (Elmaaroufi et al., 2024).

The ScenarioNL pipeline, therefore, is typified by a sequence of NLP (LLM-based), symbolic reasoning, and verification steps that map ambiguous human input to high-fidelity, actionable scenarios.

3. Mathematical and Computational Foundations

The translation from language to scenario is formalized via classification, probabilistic programming, syntactic program synthesis, and optimization algorithms:

Classification Models and Rule-based Matching: ScenarioNL systems for scenario extraction employ rule-based classifiers that threshold acceleration to categorize longitudinal activity $A_{\mathrm{lon}}$ and examine lane change/heading for lateral activity $A_{\mathrm{lat}}$ , as in

$A_{\mathrm{lon}}(a_{\mathrm{lon}})= \begin{cases} \text{Deceleration}, & a_{\mathrm{lon}} < -a_{\mathrm{lon}}^{\mathrm{thr}} \ \text{Acceleration}, & a_{\mathrm{lon}} > a_{\mathrm{lon}}^{\mathrm{thr}} \ \text{Keep velocity}, & \text{otherwise.} \end{cases}$

(Zhao et al., 2024).

Probabilistic Scenario Representation: ScenarioNL systems rely on scenario-specific DSLs (e.g., Scenic, Extended Scenic, L2I Python/Scenic code) that support assignment of random variables and joint distributions to scene attributes. Example: speeds or positions as $v \sim \text{Uniform}(a,b)$ , or $x \sim \text{Normal}(\mu, \sigma)$ (Elmaaroufi et al., 2024, Safa et al., 24 Feb 2026, Yang et al., 2023).
Scenario detection as multi-label topic segmentation and classification: Narrative-scenario detection employs unsupervised topic segmentation (TopicTiling) combined with a supervised multilayer perceptron for segment-wise scenario assignment, with performance evaluated using precision, recall, and $F_1$ across a fixed inventory of scripts (Wanzare et al., 2019).
Graph-centric code scenario sampling: In code challenge generation, a scenario-centric knowledge graph $\mathcal{D}$ 0 encodes applications, knowledge, skills, and techniques, with edge weights defined by document co-occurrence. Sampling scenarios involves Markovian traversals with softmax-tempered transition kernels to control feature diversity and complexity (Yao et al., 16 Sep 2025).

4. Application Domains and Framework Instantiations

ScenarioNL is instantiated across several technical domains:

Automated Driving and Safety Validation: Frameworks such as Chat2Scenario, ScenicNL, and Extended Scenic DSL pipelines transform English scenario descriptions and crash reports into concrete, replayable simulator scenes, supporting probabilistic variation and compliance with traffic rule monitoring (Zhao et al., 2024, Elmaaroufi et al., 2024, Safa et al., 24 Feb 2026). Outputs natively target standards such as ASAM OpenSCENARIO and simulation tools including CARLA, Esmini, and CarMaker.
Social Navigation and HRI: SocRATES automates social navigation benchmarks by synthesizing multi-agent human-robot scenarios as behavior-tree programs, using LLM-guided expansion from scenario metadata and context cues to executable simulator sessions (Marpally et al., 2024).
Narrative Understanding in NLP: ScenarioNL methods segment and label narrative text corpora with scenario/script labels, providing benchmarks for measuring script-awareness and event structure inference in LLMs (Wanzare et al., 2019).
Natural-Language-Driven Virtual Road Scenes: NLD simulation and SimCopilot map language to interaction code for object-rich, high-intensity traffic scenes, supporting systematic study of compositional and geometric generalization (Yang et al., 2023).
Adversarial and QA Agent Evaluation: ScenarioNL–style frameworks such as Neo automate generation of multi-turn, scenario-driven conversations with LLM-powered systems, sampling question/intent/tone from stochastic state models and closing the loop with evaluation and feedback (Wang et al., 19 Jul 2025).
Scenario-centric Code Problem Generation: SCoGen samples scenario-grounded code problem triples from a structured knowledge graph, assembling prompts for large-scale code LLM pretraining (Yao et al., 16 Sep 2025).

A summary of core frameworks:

Framework	Domain	Scenario Representation
Chat2Scenario	Automated driving	ASAM OpenSCENARIO + CarMaker txt
ScenicNL/ExtScenic	Automated driving	Scenic/Extended Scenic DSL
SocRATES	Social navigation, HRI	BT XML, HuNavSim + ROS2
SimCopilot	Virtual road scenes	Python/Scenic code
Neo	LLM system testing	Markov dialogue tree (NL)
SCoGen	Code problem synthesis	Scenario-centric knowledge graph
TopicTiling + MLP	NLP narrative analysis	Multi-label segment assignment

5. Empirical Evaluation and Benchmarking

ScenarioNL systems are evaluated on diverse criteria, including syntax validity, scenario fidelity, scenario coverage, simulation outcomes, and user study feedback:

Driving Scenario Extraction: Chat2Scenario achieves F1 = 0.857 (following), 0.889 (cut-in), 0.919 (cut-out) when matching scenario labels to hand-annotated ground truth on highD data (Zhao et al., 2024). Exported scenarios yield high-fidelity motion traces replayable in industrial simulators.
Probabilistic Program Generation: ScenicNL’s compiler-in-the-loop pipeline attains 90% syntactic correctness and an ARE (Accuracy, Relevance, Expressiveness) score of 4.3/5 on easy crash report scenarios, outperforming zero-shot and few-shot prompting by large margins (Elmaaroufi et al., 2024). Extended Scenic pipelines improve semantic accuracy and verifiability, yielding 100% environment/road-network, 98% oracle, and 97% trajectory correspondence to human reference. Scenarios consistently trigger intended events across thousands of sampled variants (Safa et al., 24 Feb 2026).
Social Navigation Scenario Generation: SocRATES yields 73% first-cut simulability for guided natural-language sketch tasks, more than double that of naive LLM prompting. Case studies demonstrate marked time reductions for HRI evaluation (from ~1 month hand coding to 4 days pipeline-driven study) (Marpally et al., 2024).
Natural-Language-Driven Simulation Benchmarks: SimCopilot achieves substantive gains in motion translation and compositional generalization tasks, with error rates and failure cases detailed and quantified with trajectory discrepancy metrics and qualitative error analysis (Yang et al., 2023).
Scenario Detection in Narratives: TopicTiling + MLP achieves $\mathcal{D}$ 1 overall on a 200-script, 486-document narrative corpus, with challenges attributed to closely related scripts and sparsity of rare scenarios (Wanzare et al., 2019).
Code Scenario Synthesis: SCoGen-augmented code LLMs show improvements up to +8.5% on real-world benchmarks over base models, with random sampling outperforming LLM-based selection in generating challenging scenarios (Yao et al., 16 Sep 2025).

6. Limitations, Open Challenges, and Future Directions

Common limitations across ScenarioNL systems include:

Ambiguity and Data Sparsity: Handling incomplete or ambiguous input (e.g., crash sketches, low-frequency scripts) can impair extraction fidelity. This suggests continued need for robust probabilistic modeling, self-validation, and possibly multimodal reasoning to resolve uncertainties (Safa et al., 24 Feb 2026, Elmaaroufi et al., 2024, Wanzare et al., 2019).
Domain-Specific Constraints: Each scenario domain imposes unique constraints: road topology coverage is limited (e.g., no roundabouts supported in current Extended Scenic libraries), code synthesis does not verify solution correctness post hoc, and NLP scenario labeling is challenged by script overlap and sparse annotations (Safa et al., 24 Feb 2026, Yao et al., 16 Sep 2025, Wanzare et al., 2019).
Scalability and Generalization: A plausible implication is that broadening scenario mining/translation to multi-simulator, cross-domain, or more unstructured data sources requires further advances in schema extension, toolchain portability, and data augmentation (Yang et al., 2023, Wang et al., 19 Jul 2025).
Closed-Loop and Human-in-the-Loop Refinement: ScenarioNL pipelines increasingly incorporate in-the-loop correction (e.g., compiler feedback, simulation errors). Future directions outlined in the data include vision–language critics for scenario rendering, reinforcement-learning-based refinement using simulation rewards, and extension of DSLs to richer scenario vocabularies (Elmaaroufi et al., 2024, Safa et al., 24 Feb 2026).

Collectively, ScenarioNL consolidates a methodological toolkit for rendering natural-language scenario knowledge as high-fidelity, verifiable, simulation- or test-ready programs across technical domains, establishing new benchmarks in language-to-simulation, narrative event detection, agentic evaluation, and code challenge synthesis (Zhao et al., 2024, Marpally et al., 2024, Wanzare et al., 2019, Safa et al., 24 Feb 2026, Elmaaroufi et al., 2024, Yang et al., 2023, Wang et al., 19 Jul 2025, Yao et al., 16 Sep 2025).