MMWOZ: Multimodal Dialogue Dataset

Updated 23 November 2025

MMWOZ is a multimodal extension of MultiWOZ that reinterprets dialogues as sequences of GUI operations with corresponding browser screenshots.
It standardizes a web-style interface schema across five domains using an automated conversion of belief states into precise GUI interactions.
The dataset supports robust evaluation metrics and baseline models, enabling research on integrated text and visual dialogue system development.

MMWOZ is a multimodal extension of the MultiWOZ benchmark, designed to support practical task-oriented dialogue systems that interact with both natural language and real-world web-style Graphical User Interfaces (GUIs), addressing limitations of existing datasets focused only on backend API calls. By reinterpreting MultiWOZ 2.3 dialogues as sequences of GUI operations, MMWOZ enables research on agents capable of text and visual interplay, operationalizing system actions as manipulations of web elements and providing richly annotated screenshots for each interaction step. The dataset's comprehensive data schema, automated conversion methodology, and associated baseline models equip developers and researchers with a robust platform for multimodal agent development and evaluation (Yang et al., 16 Nov 2025).

1. Motivation and Context

Traditional task-oriented dialogue systems (TODS) are fundamentally engineered to interact with backend APIs, such as invoking functions like find_restaurant(...) or book_hotel(...) given user constraints. However, commercial deployment scenarios often lack such APIs, instead exposing front-end GUIs for entity selection, filtering, or booking. This gap between symbolic API-based datasets and GUI-centric application environments is not addressed by canonical resources such as MultiWOZ 2.1–2.4 (Ye et al., 2021), which retain an API-oriented abstraction.

MMWOZ repurposes MultiWOZ 2.3 by interpreting system-side belief states and actions as a sequence of low-level GUI operations. These operations—clicks, text inputs, menu selections—are mapped directly onto a designed web interface, and their execution is captured visually using browser snapshots. The result is a dataset facilitating multimodal agent modeling grounded in both text and the GUI context (Yang et al., 16 Nov 2025).

2. Web-Style GUI Schema and Data Annotation

MMWOZ adopts a domain-specific web interface encompassing five inherited domains: restaurant, hotel, attraction, train, and taxi, each with a standardized set of GUI panels. The interface consists of:

Header bar: Displays global session information.
Top menu: Enables domain switching.
Domain panels: Each having three subpanels:
- Finding: Filters for attributes such as price, area, or type, and a "Search" button.
- Information: Entity list, each clickable for detailed views.
- Booking: Reservation form with a booking button.

The annotation process is fully automated. System actions and belief state changes from MultiWOZ 2.3 are converted into sequences of pixel-precise GUI operations by scripting (Algorithm 1). At each turn, GUI manipulations are recorded, and corresponding browser screenshots are captured programmatically. Each operation record encodes the action type (click, input), bounding box coordinates, domain context, panel, and value. Complete history is stored in a detailed JSON schema, interleaving dialogue turns, operation instruction lists, and image paths (Yang et al., 16 Nov 2025).

3. Dataset Composition and Statistics

After preprocessing and removal of unattainable automation scripts, MMWOZ comprises:

Dialogues: 9,849
- Train: 7,867
- Dev: 990
- Test: 992
Total turns: 109,558 (mean ≈ 14.09 per dialogue)
Total annotated screenshots: ≈236,700 (>2 images per system turn)
Operation types: "click" and "input"
- Clicks dominate in restaurant, hotel, attraction
- Inputs (form filling) predominate in train, taxi
Modalities per turn: natural language utterance (user/system), GUI operation annotation, screenshot images

Individual system turns exhibit on average 2.28 operations, each associated with an explicit GUI-manipulation instruction and a post-operation browser screenshot (Yang et al., 16 Nov 2025).

4. Multimodal Representations and Model Integration

Each dialogue instance associates the following modalities:

Text: Dialogue history up to the current turn.
Action log: Ordered set of GUI operations executed.
Image: Screenshot(s) post each operation, in PNG format.
OCR: Raw text parsed from screenshots using Tesseract OCR.
Visual features: Encoded using CLIP ViT-B/16 image encoder (frozen weights), projected to task-appropriate dimensions via a learned linear mapping.

The baseline MATE (Multimodal Agent for Task-oriEnted dialogue) model is configured as follows:

Architecture:
- Input: concatenated text (dialogue + operations), OCR text from latest screenshot, visual features.
- Backbone: T5-small for action sequence and natural language generation.
- Image encoder: frozen CLIP features projected linearly.
Training: cross-entropy loss over action sequences and natural-language outputs; batch=16, lr=5e-4, 10 epochs. Variants tested include MATE_text (no image encoding) and MATE_image (no OCR) (Yang et al., 16 Nov 2025).

5. Annotation Schema and Data Access

Each annotated dialogue is a self-contained JSON object. Principal fields include:

"dialogue_id"
"domains"
"turns": list of per-turn dictionaries, each with:
- "turn_id"
- "role": "user" or "system"
- "utterance"
- "screen_annotation": list of { "operation", "snapshot_id", "image_path" }
Operation encoding: { type: "click"/"input", coords: [x1,y1,x2,y2], panel: [domain, section, element], value (optional) }
Images: Linked via "snapshot_id" and "image_path" fields

Image data (screenshots) and annotation JSONs are provided for unfettered benchmarking and agent development (Yang et al., 16 Nov 2025).

6. Evaluation Metrics and Results

Evaluation of multimodal agent performance encompasses several complementary metrics:

Action-type accuracy:

$\mathrm{Acc}_{\mathrm{type}} = \frac{\#\{\text{turns with correct "GUI-ops vs. NLG"}\}}{T}$
Location accuracy:

$\mathrm{Acc}_{\mathrm{loc}} = \frac{\#\{\text{predicted operation coordinates correct}\}}{O}$
Command accuracy:

$\mathrm{Acc}_{\mathrm{cmd}} = \frac{\#\{\text{operations type+coords+value correct}\}}{O}$
Snapshot-joint accuracy:

$\mathrm{Acc}_{\mathrm{snapjoint}} = \frac{1}{S} \sum_{i=1}^S \mathbb{I}(\text{all ops at snapshot }i)$
Turn-joint accuracy:

$\mathrm{Acc}_{\mathrm{turnjoint}} = \frac{1}{T} \sum_{t=1}^T \mathbb{I}(\text{all ops in turn }t)$
Entity accuracy (NLG): proportion of system outputs containing correct entity attribute (e.g. phone, address).
BLEU: standard n-gram match for system responses.

All metrics are operationalized to allow precise quantitative comparison of agent variants and modalities (Yang et al., 16 Nov 2025).

7. Applications, Limitations, and Prospects

MMWOZ constitutes a reproducible and extensible benchmark for multimodal TOD system development. Researchers may:

Load full dialogues with their associated images and operation logs.
Train multimodal transformers capable of generating both GUI actions and text responses.
Evaluate integrated NLU + visual grounding using standardized accuracy metrics.

All necessary scripts—for dataset conversion, GUI schema instantiation, operation actuation, browser automation, screenshot capture, and baseline model training—are published alongside the dataset.

A plausible implication is that advancing from symbolic API invocation to action-by-GUI in agent modeling enables direct applicability in practical web-agent settings, particularly in domains where backend access is restricted. However, the dataset is constrained to the domains and ontologies of MultiWOZ 2.3, and further generalization requires extending the conversion and annotation machinery to new service verticals and interface paradigms. The reliance on automated conversion and rendered snapshots minimizes manual annotation, but future development may benefit from richer UI layouts and interactive/ambiguous element modeling.

MMWOZ is the authoritative multimodal dialogue benchmark bridging text, GUI, and image modalities, comprehensively supporting research and development of next-generation interactive agents (Yang et al., 16 Nov 2025).