Portal Dialogue Corpus Overview
- Portal Dialogue Corpus is a multimodal dataset capturing situated, collaborative dialogue within Portal 2 gameplay with synchronized game logs, video, and audio.
- It comprises 11.5 hours of data with 24,500 utterances, supporting studies on spatial language, clarification, repair, and ad-hoc convention formation.
- The dataset enables detailed multimodal alignment between spoken dialogue and game-state events, facilitating research on dialogue grounding and pragmatic adaptation.
The Portal Dialogue Corpus is a multimodal dataset of situated, collaborative spoken dialogue, comprising 11.5 hours of transcribed two-person gameplay within the cooperative mode of the Portal 2 video game. Designed to support fine-grained analysis of language use in complex, real-time collaborative problem-solving settings, it includes manually and automatically annotated transcripts, synchronized with game state logs, video, and audio. The corpus provides resources for studying spatial language, clarification and repair, ad-hoc convention formation, and grounding phenomena rarely observed in chitchat or task-oriented corpora (Tomlin et al., 3 Dec 2025).
1. Data Collection Protocol
The corpus was assembled from 18 dyads (36 English-speaking adults) participating in Portal 2’s Cooperative Testing Initiative mode. Participants, drawn via flyers and online postings, showed a mix of prior gaming experience. Each dyad completed a pre-survey on game experience and demographics (IRB UC Berkeley CPHS/OPHS #2023-12-17020). Sessions occurred in sound-isolated rooms with matched equipment, using Discord for in-game voice chat. Each player’s screen was recorded at 1920×1200/30fps and audio at 48 kHz via OBS. The Portal 2 game engine recorded state “ticks” (1/60 s) logging player positions, orientations, object states, and custom events. A dedicated MARK keypress produced synchronized on-screen and internal log entries for alignment.
Dyads played Chapter 1 (six levels) and, if possible, Chapter 3 (eight levels); Chapter 2 was omitted for motion sickness mitigation. In total, 11 hours 25 minutes of gameplay were captured, containing 24,500 utterances and approximately 109,000 words (mean 4.4 tokens/utterance). Per-session utterance counts ranged from 577 to 2,045 (mean 1,365).
2. Annotation Framework
Transcription combined WhisperX ASR, forced alignment with Wav2Vec, and manual revision in Adobe Premiere, with segmentation guided by temporal/intonational/semantic cues. Anonymization scrubbed personal identifiers, and Prolific workers conducted spot checks. Five annotation layers—adapted from DAMSL—were applied on a per-utterance basis:
- Communicative status: Success, Abandoned, Self-correction, Uninterpretable
- Information level: World State, World Rules, Task-Related, Communication Management, Affective Evaluation, Non-Task Related
- Uncertainty: None, Hedging, Certainty, Not Enough Info
- Utterance type: Proposition, Imperative, Query, Tag Question, Exclamation/Performative, Non-Sentential, etc.
- Discursive act: Offer/Option, Directive, Request for Info, Request for Clarification, Assertion, Justification, Speculation, Commit, Confirmation/Status, Acknowledgment, Rejection, Expressive
Manual inter-annotator agreement was measured on 176 utterances (3 levels; 4 annotators; text-only vs. audio+video), as summarized in Table 1.
| Cohen’s κ | Com. | Inf. | Unc. | Utt. | Dis. |
|---|---|---|---|---|---|
| Text Only | 0.68 | 0.61 | 0.39 | 0.72 | 0.58 |
| A/V | 0.66 | 0.62 | 0.39 | 0.72 | 0.56 |
κ scores indicate “substantial” agreement (κ > 0.60) for Communicative Status, Information Level, and Utterance Type; moderate for Uncertainty and Discursive Act.
Automatic annotation using GPT-4o labeled the full 24.5K utterance set. Agreement between GPT-4o and each human annotator is presented in Table 2.
| Cohen’s κ | Com. | Inf. | Unc. | Utt. | Dis. |
|---|---|---|---|---|---|
| GPT-4o | 0.48 | 0.44 | 0.30 | 0.52 | 0.28 |
A plausible implication is that future improvements in automatic labeling, particularly for Discursive Act and Uncertainty, are warranted.
Subtask completion timestamps (manual; Chapter 1 only) enable mapping of behavioral progress and grouping atomic subtasks into higher-level actions.
3. Catalog of Linguistic Phenomena
The corpus documents a spectrum of interactional and pragmatic features emergent in situated collaboration:
Ad-hoc Convention Formation:
Players collectively constructed referential shorthand and shared labels for novel entities, e.g., light bridge “catch.” Metonymic usage (e.g., “blue” for “blue portal”) and procedural abstractions (e.g., “do the same trick”) are widespread.
Spatial Reference and Perspective-Taking:
Directives using deictic or relative terms (“left,” “right”) frequently require frame-of-reference negotiation, as illustrated in exchanges clarifying whether “right” is egocentric or partner-relative.
Clarification and Repair Sequences:
There are 532 Dialogue Act–tagged “Request for Clarification” utterances (other-initiation) and 1,922 “Self-Correction” utterances. These sequences reveal the dynamic management of referential uncertainty and error correction in collaborative problem-solving.
This suggests the dataset is highly suitable for research on perspective alignment, convention formation, repair, and grounding phenomena seldom encountered in more formulaic dialogue corpora.
4. Quantitative Statistics and Dialogue Structure
Utterance and Vocabulary:
24,500 utterances total; 109,000 word tokens; mean 4.4 tokens/utterance (exponential distribution). The vocabulary comprises ~6,200 unique types (Zipfian distribution).
Turn-Taking Timings:
- Silence (gap): median 666 ms
- Overlap (next turn begins before end of previous): median –506 ms
- For comparison, the CANDOR corpus yields 380 ms gap, –410 ms overlap
Task Progress and Speed:
Mean completion time per level is 5:50 (SD 2:58). Twelve dyads progressed to Chapter 3, with fastest achieving 14 levels (median 8). Prior Portal experience predicted faster completion (β = –6.286; p < 0.05).
Dialogue Act Distributions (automatic labels):
- Communicative Status: Success >80%, Abandoned ≈2%, Self-Correction ≈8%, Uninterpretable ≈1%
- Information Level: Task-Related ≈35%, World State ≈25%, Communication ≈20%, Affective ≈10%, Non-Task ≈10%
- Uncertainty: Hedging ≈45%, None ≈50%, Certainty ≈5%
- Utterance Type: Non-sentential backchannels ≈30%, Queries ≈15%, Imperatives ≈20%, Propositions ≈25%, Others ≈10%
- Discursive Act: Expressives ≈20%, Acknowledgments ≈15%, Directives ≈15%, Offers/Options ≈10%, Confirmations ≈10%, remainder distributed
5. Multimodal Integration and Alignment
Game engine “ticks” (1/60 s) enabled logging of:
- Player (x, y, z) positions and (yaw, pitch) orientations
- Active portals: position, surface normal, color
- Object (cubes, lasers, buttons) positions, velocities, states (active/inactive)
- “MARK” events and subtask completions
Temporal alignment is anchored by the MARK keypress (producing on-screen and internal log events), in-game event triggers, and Adobe Premiere waveform matching. Utterance-to-state mapping employs CSV join by nearest tick; optional interpolation provides higher-resolution for agent modeling. The dataset thus supports detailed temporal mapping of linguistic acts to in-game context for grounded language analysis.
6. Scope, Limitations, and Recommended Use Cases
Potential Biases and Gaps:
- Domain bias restricts generalizability to Portal 2 co-op mode
- Participant bias from self-selected gamers and IRB consent implications
- Annotation coverage incomplete: only Chapter 1 fully subtask-labeled
- Moderate inter-annotator agreement for Uncertainty and Discursive Act; automatic GPT-4o labels show limited consistency (κ down to 0.28)
Recommended Applications:
- Development of situated and task-oriented dialogue models, particularly those requiring multi-modal grounding and negotiation
- Spatial language and reference frame research
- Modeling of conversational grounding, clarification, repair, and convention formation in dynamic environments
- Multimodal policy learning for embodied AI integrating language, vision, and symbolic state
- Pragmatic adaptation analysis: how agents dynamically develop and deploy conventions
Suggested Extensions:
- Subtask annotations to later chapters; larger dyad sample for cross-level analysis
- Domain-specific fine-tuning of automatic dialogue-act tagging
- Cross-modal state tracking models that leverage aligned utterance and game-state data
The Portal Dialogue Corpus constitutes a resource for granular analyses of collaborative language in situated, referential, and high-uncertainty contexts, and establishes a foundation for benchmarking dialogue, grounding, and pragmatic adaptation phenomena in multimodal, embodied settings (Tomlin et al., 3 Dec 2025).