LUCID Dataset: Modular Dialogue Generation
- The LUCID dataset is a modular, LLM-driven corpus featuring 4,277 dialogues across 13 domains with rigorous, schema-based annotation.
- It employs an automated pipeline for intent and slot generation, compositional dialogue planning, and turn-by-turn LLM-based synthesis.
- The dataset systematically includes challenging phenomena such as sarcasm, corrections, and ambiguous requests for robust dialogue model evaluation.
The LUCID dataset is a modular, LLM-driven resource for complex, task-oriented dialogue generation, designed to address the need for realistic and challenging conversational data in the development and evaluation of advanced virtual assistants. It features automated schema generation, intent and slot modelling, compositional dialogue planning, turn-by-turn LLM-based utterance synthesis, and rigorous multi-stage validation, all resulting in a high-quality, large-scale corpus rich in conversational phenomena such as sarcasm, in-turn corrections, and ambiguous requests (Stacey et al., 1 Mar 2024).
1. Motivation and Objectives
The primary motivation for the LUCID dataset is the documented scarcity of high-quality, diverse dialogue data specifically suited for training and evaluating task-oriented systems that must handle challenging user behaviors. Existing resources, while large, often lack domain breadth and adequate representation of conversational complexities (e.g., corrections, cancellations, OOD queries), and even when such phenomena occur, they are rarely explicitly labeled. Human annotation remains a scaling bottleneck. LUCID aims to provide a scalable, minimally human-in-the-loop solution utilizing LLMs to generate, annotate, and validate realistic dialogues across dozens of domains, allowing both the bootstrapping of new domains and robust, systematic evaluation of dialogue models, including OOD generalization.
2. Data Generation Pipeline
The LUCID data generation process is partitioned into fourteen functionally distinct stages that together automate the creation of diverse, high-quality dialogues. The pipeline is summarized as follows:
- Stages 1–2 (Intent and Slot Schema Construction): An LLM receives a short human-authored intent description and generates a formal intent schema, including intent and slot names, slot data types, and mandatory/optional annotations. Slot values are generated programmatically to maintain type and domain consistency.
- Stages 3–8 (Conversation Planning): The conversation planner samples intent sequences for each conversation, refines slot values for realism, and plans the appearance of "unhappy path" phenomena (sarcasm, corrections, delayed confirmations, etc.). Rules ensuring attribute consistency (e.g., time constraints between check-in and check-out) are applied.
- Stages 9–12 (Dialogue Synthesis): A simulated user-agent and system-agent exchange utterances, with the system agent referencing the schema and maintaining a state tracker. Each new slot value is grounded in text spans from the user utterance to avoid hallucination, and responses are rendered into naturalistic language by a separate LLM. Explicit Python-style function calls or schemas underpin annotation.
- Stages 13–14 (Rigorous Multi-LLM Validation): Multiple validation agents re-predict system responses with randomized seeds/temperatures; only dialogues for which all validators reach complete agreement are retained. Prefixes up to validation failure are saved, leading to high label consistency.
This modular pipeline ensures fine-grained control over both the diversity and the validity of the output, supporting rapid adaptation to new domains or intent ontologies.
3. Dataset Composition and Properties
LUCID’s seed release consists of 4,277 dialogues (92,699 utterances), encompassing:
- 100 unique intents distributed across 13 application domains, such as transport, appointments, lists, and reviews.
- 501 distinct slot types, each with well-specified coverage throughout the corpus.
- Challenging conversational features, systematically modelled and labelled, including:
- Sarcasm
- Overheard utterances
- Self-correction and revision
- Slot value mismatches and clarifications
- Cancellations and out-of-distribution requests (OOD)
Additionally, the dataset offers both seen-intent test sets and OOD splits for robust evaluation of generalization.
Property | Value/Description | Notes |
---|---|---|
Dialogues | 4,277 | 92,699 turns in total |
Unique Intents | 100 | 13 domains |
Slot Types | 501 | Automatic, type-safe generation and grounding |
OOD Test Sets | Yes | Unseen intent/generalization evaluation |
Challenging Phenomena | Sarcasm, corrections, OOD, etc. | Systematically included and annotated |
4. Automated Validation and Quality Assurance
To ensure label and utterance quality, LUCID incorporates redundant, LLM-based validators at the final pipeline stages. The system uses a “if in doubt, discard” policy, with multiple independent LLM calls (under varied temperatures) producing label and response predictions for each turn. Prefix truncation is used to preserve only reliably validated data. Human evaluation on 200 samples found only ~1% systemic annotation error rate and confirmed the faithful reflection of planned intent-slot values in the dialogue. This strict validation—in combination with schema-driven generation—minimizes annotation noise and ensures that only high-confidence dialogues are retained.
5. Applications and Research Implications
The features and quality of the LUCID dataset support several research and development objectives:
- Training and Evaluation of Dialogue Models: Provides turn/intent/slot-supervised data for complex, multi-intent scenarios requiring semantic parsing, state tracking, and nuanced response selection.
- OOD/Generalization Studies: OOD test splits and challenging behaviors allow direct benchmarking of model robustness.
- Rapid Domain Bootstrapping: The modular LLM pipeline allows fast, scalable adaptation for new domains or ontologies with minimal human labor.
- Research on Conversational Complexity: The explicit modeling and annotation of unhappy paths facilitate studies on error recovery, ambiguity, and conversational repair strategies.
A plausible implication is that LUCID sets a precedent for highly modular, LLM-centric data generation pipelines in language technology, where compositionality and automated validation enable both high scalability and verifiable label quality.
6. Limitations and Future Directions
Approximately 56% of generated dialogues pass all automated checks, and truncated prefixes are retained for failed validations, indicating a residual gap in fully error-free, end-to-end conversational generation. The modular pipeline can in principle be adapted to cover more domains or dialogue styles, and the methodology encourages further research on LLM-driven schema and label validation. As conversational agents continue to develop, LUCID’s framework can be extended with additional modules for dynamic knowledge grounding or multi-agent interactions.
7. Significance for NLP and Dialogue System Research
LUCID’s paradigm demonstrates that the systematic decomposition of dialogue data generation into primitive LLM-driven steps—with built-in redundancy and schema-constrained validation—yields a resource that matches or exceeds manual annotation on quality for complex conversational phenomena. The dataset enables scalable evaluation and domain adaptation in task-oriented systems, providing controlled testbeds for conversational AI under realistic, challenging, and diverse dialogic conditions. The dual release of data and generation code establishes a benchmark for reproducibility and extensibility in dialogue corpus creation (Stacey et al., 1 Mar 2024).