Oolong-real: Long-Context Reasoning Benchmark
- Oolong-real is a benchmark that assesses long-context reasoning by evaluating models on multi-step aggregation tasks over D&D transcripts.
- It requires models to perform precise counting, distributional analysis, enumeration, and temporal aggregation using real noisy conversation data.
- The benchmark overcomes limitations of retrieval-focused evaluations by emphasizing comprehensive aggregation across extended, multi-episode inputs.
Oolong-real is a downstream benchmark assessing long-context reasoning and aggregation capabilities of LLMs on real-world conversational data. As a component of the Oolong benchmark suite, it targets tasks that require atomic-level analysis of text units within extended transcripts, followed by non-trivial aggregation—encompassing counting, distributional analysis, enumeration, and temporal aggregation. Designed to overcome limitations in prior long-context evaluations which focus primarily on retrieval tasks, Oolong-real introduces evaluation conditions that mandate models attend to and aggregate over substantial portions or entirety of the input, thereby probing models’ ability to accumulate, filter, and compose information across challenging, naturally-occurring dialogue.
1. Problem Definition and Motivations
Oolong-real entails question-answering over transcripts from live-play Dungeons & Dragons (D&D) episodes. Each episode transcript comprises 50K–100K tokens, with multi-episode contexts reaching up to 175K tokens in standard evaluation (and up to 1.3M tokens in principle). Models are presented with two types of input contexts: single-episode (average ≈ 55K tokens) and multi-episode (e.g., 55K, 118K, 175K tokens for 1, 2, or 3 episodes). Questions span:
- Counting: “How many total dice rolls were made in episode 3?”
- Distributional: “What percentage of all rolls across episodes 1–3 were natural 20s?”
- Enumeration/Indexing: “What is the third spell cast in episode 2?”; “List the last spell cast in each of episodes 1–3.”
- Cumulative/Temporal: “By the end of episode k, how many spells have been cast in total?”
Each query may necessitate (a) filtering all or a relevant subset of utterances (e.g., by character), (b) performing a classification or counting operation on filtered segments, and (c) aggregating these operations (with possible temporal constraints or cross-episode reasoning).
The benchmark targets prevalent open problems in scaling LLM context windows: determining not just whether models can retrieve isolated information, but whether they can ingest lengthy, noisy, and heterogeneously formatted data and perform fine-grained, multi-step aggregation.
2. Dataset Construction
2.1 Data Sources
The primary textual data stems from the “Critical Role Dungeons & Dragons Dataset (CRD3)” (Rameshkumar & Bailey, 2020), extracted from Campaign 1, comprising 115 episodes, each representing a contiguous, ~4–5 hour live session. Episodes yield full transcripts, including speaker turns with annotations for non-standard utterances (e.g., dice rolls, system messages).
Supporting annotations—labeling all dice roll and spell-cast events—are sourced from the “CritRoleStats” project, a fan-maintained, multi-pass validation system that tallies per-episode statistics regarding dice roll outcomes (by value, character, roll type, natural 20/1), and spells (type, caster, level, and episode-local ordering).
2.2 Scale and Annotation Protocol
Contexts for evaluation include:
- Single-episode: 115 episodes, avg ≈ 55K tokens, max ≈ 70K.
- Multi-episode: up to 3-episode sequences (≅55K, 118K, 175K tokens); scalable up to 24 episodes (1.3M tokens).
Annotation involves:
- Cataloging every dice roll with associated character, player, roll type, outcome, and “naturalness.”
- Identifying all spell-cast utterances with full metadata: character, player, spell name, spell level, base level, and in-episode order.
These annotations serve as the reference (“gold”) for all downstream answer key derivation.
3. Task Taxonomy and Evaluation Metrics
3.1 Task Categories
Tasks in Oolong-real are systematically organized by the aggregative reasoning required:
| Category | Representative Questions |
|---|---|
| Counting / Distributional | “Total number of rolls in this episode?”<br>“What percent of rolls were nat 20s?” |
| Enumeration / Indexing | “What is the nth spell cast in episode m?”<br>“List the last spell cast in each episode.” |
| Cumulative/Temporal | “By episode k, how many rolls have been made?”<br>“Total spells of type X by end of episode k?” |
3.2 Formal Evaluation Metrics
- Single-string or label answers (e.g., character/spell names):
- Numeric answers (counts, percentages):
- List-valued answers: Precision, recall, and F1 on set overlap:
Scoring is established to grant partial credit for close but incorrect numeric answers, while set-valued evaluations reward partial enumeration overlap.
4. Experimental Protocol and Baselines
4.1 Input Pipeline and Prompting
Prompts are constructed as follows:
- Brief natural language description of the aggregation/statistics task, e.g., “You are given the episode transcript and statistics are to be returned in \boxed{…}.”
- Full transcript input, preserving line-level speaker tags.
- Final prompt line contains the task question, with explicit instruction to “Answer in \boxed{…}.”
- No chain-of-thought exemplars are provided; models are required to perform all intermediate reasoning internally.
Evaluation includes single- and multi-episode sequences, up to 175K tokens (with models supporting ≥200K token context lengths).
4.2 Models and Inference Settings
Nine models are benchmarked:
- Proprietary: GPT-5, GPT-5-mini, GPT-5-nano (API), Gemini-2.5-Pro (API), Claude-Sonnet-4 (API)
- Open-weights: o3, o4-mini, Deepseek-R1, Llama-4-Maverick
Inference is via deterministic decoding (temperature=0.0, top_p=1.0) with ample output-token budget to avoid truncation.
A random baseline is established: for each task type, output is sampled uniformly over valid answer sets, with numeric scores calculated accordingly.
5. Performance Analysis and Error Modes
5.1 Quantitative Performance
Empirical results demonstrate a substantial performance drop as context length increases. Table 1 summarizes model averages over three context windows:
| Model | Avg. | 55K | 118K | 175K |
|---|---|---|---|---|
| GPT-5 | 0.470 | 0.5874 | 0.4572 | 0.3653 |
| Gemini-2.5-Pro | 0.530 | 0.6012 | 0.5081 | 0.4793 |
| o3 | 0.367 | 0.5057 | 0.3357 | 0.2599 |
| GPT-5-mini | 0.346 | 0.4986 | 0.2990 | 0.2389 |
| Claude-Sonnet-4 | 0.368 | 0.5058 | 0.3298 | 0.2670 |
| o4-mini | 0.271 | 0.4169 | 0.2177 | 0.1793 |
| GPT-5-nano | 0.311 | 0.4309 | 0.2682 | 0.2323 |
| Deepseek-R1 | 0.320 | 0.4785 | 0.2735 | 0.2081 |
| Llama-4-Maverick | 0.021 | 0.0248 | 0.0211 | 0.0162 |
All models show a 20–30 point score decrease from 55K to 175K tokens. Notably, leading proprietary models (Gemini-2.5-Pro, GPT-5) are unable to surpass 0.53 average accuracy at the most favorable context length; at 175K, no model exceeds 0.48.
5.2 Failure Modes
Commonly observed failure points include:
- Token-budget exhaustion: Models sometimes fail to generate answers within output limits when required to reason across long contexts.
- Speaker confusion: Errors in attributing actions to characters, particularly with frequent use of pronouns or aliases.
- Misordered enumeration: Incorrectly resolving event order (e.g., first/second spell) when conversational interleaving of events is dense.
- Over-aggregation/hallucination: Falsely reporting counts due to pattern-matching spurious silos (e.g., interpreting “roll for damage” as a new roll event).
- Premature termination/refusal: Declining to answer on claimed grounds of excessive context or task infeasibility.
6. Interpretation of Results and Future Challenges
6.1 Insights on Real-World Aggregation
Generalization to real conversations exposes substantial limitations even in contemporary frontier models. Key insights include:
- Filtering of off-topic or irrelevant utterances remains unreliable, hindering selectivity amidst conversational noise.
- Drop-off in performance with context length is comparable to synthetic aggregation tasks, highlighting that mixed-format heterogeneity compounds pre-existing long-context challenges.
- Model strengths diverge: Gemini-2.5-Pro demonstrates robustness to output length constraints, while Deepseek-R1 shows improved relative performance on authentic conversational data compared to synthetic settings.
6.2 Ongoing and Proximal Research Directions
Persistent difficulty motivates several research avenues:
- Single-pass context chunking/retrieval: Mechanisms to isolate and aggregate only task-relevant utterances in a long, unstructured context.
- Hierarchical or memory-augmented architectures: Models that synthesize and reference intermediate summaries, entity states, and event chains during inference.
- Enhanced prompt disambiguation: Systematic injection of metadata for resolving pronouns, aliases, and conversational ambiguity.
- Augmented evaluation protocols: Incorporation of “gold” chain-of-thought annotations to parse error loci (filtering, classification, pooling).
- Domain generalization: Extension to other naturalistic conversational corpora (e.g., customer service, multi-party meetings) to assess task spectrum and transfer.
Release of Oolong-real data, gold labels, and an evaluation scaffold supports ongoing development of modeling and prompting techniques targeting compositional, multi-step reasoning over extended, noisy, real-world dialogue—realizing a critical next step in long-context LLM evaluation and development.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free