ConvLab: Open-Source TOD Toolkit

Updated 22 January 2026

ConvLab is an open-source platform for task-oriented dialogue systems that standardizes data representation and supports diverse architectures.
Its modular pipeline decomposes dialogue agents into NLU, DST, Policy, and NLG components, enabling plug-and-play and end-to-end approaches.
Empirical findings and integrated diagnostic tools facilitate transfer learning, robust policy evaluation, and reproducibility across heterogeneous datasets.

ConvLab is an open-source research platform and toolkit suite for building, training, evaluating, and diagnosing task-oriented dialogue (TOD) systems, emphasizing modularity, extensibility, unified data interfaces, and reproducible benchmarking. ConvLab supports the pipeline paradigm—Natural Language Understanding (NLU), Dialogue State Tracking (DST), Policy Learning, Natural Language Generation (NLG)—as well as fully end-to-end neural models. Its unified framework covers classic pipeline, hybrid, and word-level/latent-action architectures, accommodates heterogeneous datasets, integrates simulation and human evaluation, and provides component-level diagnostic tools. Major releases include ConvLab (Lee et al., 2019), ConvLab-2 (Zhu et al., 2020), and ConvLab-3 (Zhu et al., 2022), each expanding in dataset coverage, model variety, diagnostic support, and unified interfacing.

1. Unified Data Representation and Dataset Integration

ConvLab-3 introduces a standardized data schema to bridge idiosyncratic dataset formats, reducing model integration effort from $O(M \times N)$ to $O(M + N)$ for $M$ models and $N$ task-oriented dialogue corpora (Zhu et al., 2022). Each dataset is represented by:

ontology.json: Encodes domain schemas as a JSON object with:
- domains: maps domain name to slot definitions, where each slot has is_categorical, possible_values, and description.
- intents: mapping from intent names to descriptions.
- dialogue_acts: lists (categorical, non-categorical, binary) of act tuples $(\text{speaker}, \text{intent}, \text{domain}, \text{slot})$ .
- state_template: initializes domain slots to None.
dialogues.json: A list of dialogues, each with fields for dataset, split, dialogue id, domain(s), user goals (informable, requestable), and turn-level annotations:
- For each turn: speaker, utterance, dialogue_acts, state (updated belief), database output (results after API/DB query).
database/ (or API wrapper): Implements the abstract BaseDatabase.query(domain, state, topk)→ List[Dict], supporting SQL, REST, or knowledge-graph interfaces.

This unified schema underpins all ConvLab-3 modules and evaluation, allowing seamless cross-dataset transfer and plug-and-play modeling. Adding a new dataset requires only conversion scripts and, if necessary, a database API adapter.

2. Modular System Architecture

ConvLab adopts a modular, pipeline-oriented abstraction, decomposing a dialogue agent into principal components (Lee et al., 2019, Zhu et al., 2022):

NLU: Maps user utterances to dialogue acts—(intent, domain, slot, value).
DST: Updates the belief state from prior state and recognized acts.
Policy: Maps the semantic state representation to system dialogue acts.
NLG: Translates system acts back to surface utterances.
Vectoriser: Encodes semantic state/features for downstream processing (e.g., embeddings, slot bag, uncertainty).
User Simulator: Generates user acts/utterances for training/testing via various paradigms.

Interaction is orchestrated via a turn-level flow:

user_acts = user_simulator.step(dialogue_state)
user_utt = user_nlg.generate(user_acts)
sys_acts = agent_nlu.predict(user_utt)
dialogue_state = agent_dst.update(dialogue_state, sys_acts)
state_vector = vectoriser.encode(dialogue_state, sys_acts)
sys_actions = agent_policy.predict(state_vector)
sys_acts = vectoriser.decode_actions(sys_actions)
sys_utt = agent_nlg.generate(sys_acts)

All components are configurable and swappable via YAML/JSON config files. ConvLab-2 (Zhu et al., 2020) generalizes the Agent abstraction to both system and user, supporting pipeline, hybrid, end-to-end, and even multi-party dialogue settings (e.g., self-play, role-play).

3. Reinforcement Learning and Policy Optimization

ConvLab-3 formalizes policy learning for TOD as a Markov Decision Process (MDP) (Zhu et al., 2022):

State space $S$ : Comprises belief state $b_t$ , database result $d_t$ , and optional uncertainty/confidence features.
Action space $A$ : Enumerated as multi-hot subsets of atomic triples (intent, domain, slot).
Transitions $P(s'|s,a)$ : Induced by user simulators and DST updates.
Reward $R(s,a)$ : Typical shaping includes end-of-dialogue task success ( $+20$ ), per-turn penalty ( $-1$ ), penalties for illegal acts, and slot-inform shaping (e.g., $+0.5$ per newly informed slot).
Discount: $\gamma=0.99$ (standard).

ConvLab-3 supports multiple RL methods:

REINFORCE
PPO (Proximal Policy Optimization)
V-Trace (for distributed RL)
DDPT (continual learning)
CLEAR (stability via experience replay)

Objective functions follow policy gradient and actor-critic formulations:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \gamma^t r_t \right]$

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t \right]$

Reward shaping, action masking, and experience replay stabilize training and improve sample efficiency. Empirical results show that pre-training on large source datasets and then RL fine-tuning in domain-adapted or low-resource scenarios yields better task success, efficiency, and robustness.

4. User Simulators and Evaluation Methodology

ConvLab provides diverse user simulators for both training and standardized evaluation (Zhu et al., 2022):

ABUS: Agenda-based, rule-driven state machine.
TUS (Transformer User Simulator): State-conditioned neural model predicting dialogue acts.
GenTUS/EmoUS: Generative, seq2seq models producing full utterances, optionally with emotion labels.

Comprehensive evaluation is performed using the unified_evaluator module, providing metrics such as:

Task Success Rate (strict/relaxed)
Average Reward/Return
Dialogue Length (#turns)
Average Number of Actions per Turn
Intent Distribution (for policy analysis)
Component-Specific (e.g., NLU act F1, DST joint goal accuracy, NLG BLEU/slot error rate)
User Simulator: Act accuracy, naturalness by slot error rate
Manual Inspection: Full rollouts/exported transcripts

Cross-simulator evaluation highlights policy overfitting risks—policies may attain high task success against a specific simulation style (e.g., ABUS) but degrade with others (e.g., GenTUS), indicating the need for simulator diversity in policy validation.

5. Transfer Learning and Empirical Findings

ConvLab-3 facilitates transfer learning and robust cross-domain policy development (Zhu et al., 2022):

Supervised Pre-training & Fine-tuning: Models pretrained on large, heterogeneous datasets (e.g., SGD, Taskmaster) and fine-tuned on MultiWOZ 2.1 demonstrate substantial gains, especially under limited target data scenarios. For instance, T5-DST joint goal accuracy jumps from 14.5% (1% data) to 35.5% with pre-training, and from 35.5% to 52.6% at 10% data.
End-to-End Task Scores: Soloist, SC-GPT, and T5-NLG show analogous gains for NLG BLEU and combined end-to-end metrics.
RL Experiments: Policies initialized from pretrained weights converge faster and exhibit better sample efficiency.
Uncertainty-Aware Policies: Incorporating DST uncertainty scores into the policy input vector encodes epistemic uncertainty, driving agents to prefer clarifying/requesting acts and yielding increased robustness to upstream errors and NLG/intent noise.

These results highlight ConvLab's utility for studying transferability, low-resource learning, cross-simulator generalization, and the translation of component-level uncertainty into improved policy robustness.

6. Diagnostic and Analysis Tools

ConvLab-2 introduces an analysis tool and an interactive web-based debugging interface (Zhu et al., 2020):

Analysis Tool: Runs mass simulations (e.g., 1,000 rollouts), reporting standard metrics, NLU confusion matrices, policy error counts, NLG “hard cases,” and root causes of dialogue loops. For example, Request-Hotel-Phone appeared as the dominant loop trigger in MultiWOZ’s hotel domain.
Interactive Tool: Allows real-time, turn-level inspection and manual correction of component outputs (NLU, DST, Policy, NLG), supporting rapid diagnosis and targeted system modification.

These tools enable systematic bottleneck identification, targeted model/component upgrades, and validation of global improvements via component-level fixes.

7. Usage, Extensibility, and Experiment Management

ConvLab emphasizes experiment reproducibility, configurability, and extension (Lee et al., 2019, Zhu et al., 2020, Zhu et al., 2022):

Experiment Configuration: All experimental runs, component choices, and hyperparameters are specified in a single JSON/YAML file, supporting both pipeline and end-to-end settings.
Component Extensibility: Adding custom NLU/DST/Policy/NLG modules requires subclassing the relevant base interface and registering the new class; integration is carried out via configuration files.
Dataset Expansion: New datasets are integrated by writing conversion scripts for the unified schema and, if necessary, implementing a database API wrapper.
Reproducibility and Sharing: All experiments are launched and managed from a single entry point, using Ray and SLM Lab for distributed and hyperparameter search workflows. Every run is fully reproducible and shareable.

Minimal code for a complete pipeline—including loading, agent construction, RL training, and evaluation—is provided in the documentation, streamlining onboarding for both experienced researchers and newcomers.

ConvLab, across its generations, constitutes a comprehensive platform and toolkit for systematic, reproducible TOD research, enabling rigorous model comparison, transfer learning experiments, RL training, detailed evaluation, and component-level diagnosis in a unifying framework (Lee et al., 2019, Zhu et al., 2020, Zhu et al., 2022).