Oolong-synth Benchmark for Long-Context LLMs
- Oolong-synth is a synthetic benchmark suite designed to rigorously test LLMs' long-context reasoning and aggregation abilities via distributed classification tasks across extensive token windows.
- It constructs evaluation instances using atomic classification subtasks combined with global aggregation questions, emulating real-world multi-step reasoning challenges.
- Empirical results show a significant drop in accuracy with increased context length, highlighting limitations in memory retention and output control among current LLMs.
Oolong-synth is a synthetic benchmark suite designed to rigorously evaluate the long-context reasoning and aggregation abilities of LLMs. Developed as part of the Oolong benchmark, Oolong-synth addresses limitations in earlier long-context evaluations, which primarily focused on retrieval tasks rather than the demanding multi-step reasoning and global aggregation seen in real-world applications that require processing and integrating information across hundreds of thousands of tokens.
1. Motivation and Targeted Reasoning Capabilities
The impetus for Oolong-synth stems from the observation that prevailing long-context evaluations often reduce to simple retrieval—locating a “needle in a haystack”—as opposed to requiring comprehensive reasoning over many local units. In realistic scenarios such as user log analysis, tracking label distributions over time, or summarizing trends, a model must:
- Identify relevant text spans distributed across a vast context
- Perform localized reasoning subtasks (e.g., sentence-level classification)
- Aggregate the results into a global answer through operations like counting or comparison
Oolong-synth formalizes such a paradigm by decomposing the problem into atomic classification subtasks and subsequent distributional aggregation, thereby directly testing a model’s ability to (i) manage per-example decisions at scale and (ii) execute accurate global aggregation, even as context-window sizes approach hundreds of thousands of tokens.
2. Task Formalism
Each Oolong-synth instance is constructed as follows. Let denote a finite label set (e.g., ) drawn from a conventional short-context classification task. The context window of examples is configured as:
where:
- : -th text
- : gold label (hidden from the model)
- : synthetic user-ID (sampled to simulate realistic user distributions)
- : synthetic date (spanning a 40-month interval)
The model only observes for all and must infer implicitly. After the context is presented, a distributional question is posed, which falls into one of three broad families:
- Counting: e.g., “Which label is most frequent in ?”, “How many examples have label ?”
- User-conditioning: e.g., “Among users , who produced the most negative examples?”
- Temporal: e.g., “Was label more common before date or after date ?”, “In how many months did outnumber ?”
Numeric answers are graded by a proximity-based metric:
where is the gold answer and the model's output. For categorical or comparative answers (label, user, date), exact matches are required.
3. Synthetic Dataset Construction
The Oolong-synth dataset is meticulously constructed to ensure both challenge and reliability:
- Source Tasks: Ten widely studied text-classification datasets (e.g., Spam, AGNews, IMDB, MultiNLI) serve as the basis, each spanning 2–10 labels. To filter out labeling noise, two mid-tier models (GPT-4.1-nano and Llama 4-Maverick) operate in zero-shot mode; any example misclassified by both is removed, excising up to approximately 0.6% of any set.
- Context Sizing: Average token length per example, including synthetic “User: ” and “Date: ” metadata, is estimated using the Llama 2 tokenizer. The maximal is selected to fill roughly 95% of a target context budget (powers of two from 1K to 4M, with focused reporting from 8K to 128K tokens).
- User & Date Sampling: User-IDs are sampled from a Zipf-like distribution (80% of examples from 20% of user-IDs), dates are uniformly sampled over a 40-month window, both with replacement, to simulate repeated activity and naturalistic usage patterns.
- Question Sampling: For each , two independent windows per source dataset are sampled, each receiving 25 unique distributional questions, yielding 50 questions per dataset per context length.
The construction enables systematic ablation and analysis by controlling local and global reasoning difficulty.
4. Evaluation Methodology
Evaluation is performed using several rigorously defined metrics:
- Exact-match accuracy is reported for categorical, user-ID, date, and comparative outputs.
- Graded numeric scoring for counts and percentages follows , granting partial credit for near-correct responses.
- Principled random baseline: incorporates (i) uniform random selection among valid labels, (ii) numeric guesses at , and (iii) random choice among observed dates/users.
- Output parsing relies on template-matching; truncated or malformed responses are scored as zero.
This methodology ensures that models are not rewarded for superficial matching or gaming the task format.
5. Experimental Findings at Scale
Empirical results (Table 1, Figure 1 of (Bertsch et al., 4 Nov 2025)) demonstrate that current frontier models show substantial accuracy decline as context length increases:
| Model | 8K | 32K | 64K | 128K |
|---|---|---|---|---|
| GPT-5 | 85.6% | 76.1% | 61.2% | 46.4% |
| Gemini-2.5-Pro | — | — | — | <50% |
| Claude-Sonnet-4 | — | — | — | <50% |
- No model exceeds 50% accuracy at 128K tokens.
- The random baseline is substantially lower, validating the legitimate reasoning engaged; nevertheless, models scale poorly on aggregation and memory retention.
Key observed failure modes include:
- Output Budget Exhaustion: “Running out of tokens” during chain-of-thought completion, leading to premature or incomplete aggregation steps.
- Refusals and Output Truncation: Overly cautious self-doubt (especially in Deepseek-R1) or total truncation if output-length or filtering constraints are triggered (notably Gemini).
- These pathologies underscore specific weaknesses in memory and output control mechanisms within current LLM architectures.
6. Ablation Studies and Question-Type Difficulty
Systematic ablations reveal the bounds of current capabilities:
- Chain-of-thought (CoT) Intensity: Adjusting reasoning-effort parameters produces only marginal improvement at shorter contexts (K) and offers no benefit beyond K tokens, implicating global memory as the primary bottleneck, rather than local reasoning prowess.
- Providing Gold Labels: Annotating each example with its true (removing the need for model classification) yields, at most, 11 percentage points gain in accuracy, typically much less. This suggests that aggregation—not classification—is the dominant challenge at scale.
- Short-context regime: Even at 1K–8K tokens (dozens of examples), no model surpasses 85% accuracy, indicating that aggregation remains non-trivial even with modest windows.
- Question-Type Breakdown (Figure 2): Temporal aggregation (“before/after date X,” monthly comparisons) emerges as the most challenging, followed by user-conditional questions; pure counting tasks are comparatively the easiest. Outputs requiring precise month- or date-level reporting show the largest gaps between leading and sub-leading models.
7. Current Limitations and Directions for Future Research
Oolong-synth exposes a clear divergence between two performance regimes:
- Local Subtask Competence: Many LLMs attain 90% accuracy on single-example labeling under standard in-context learning.
- Global Multistep Aggregation: When tasked with aggregating outputs in long contexts, performance drops precipitously, revealing a fundamental bottleneck in memory management and aggregation mechanisms.
The following strategies are proposed for advancing long-context reasoning:
- Modular or Retrieval-Augmented Decomposition: Filtering to isolate relevant examples before external aggregation (e.g., binary filters followed by summary operations).
- Memory-Efficient Joint Reasoning: Architectures or prompting schemes that process windows in 1–2K sized chunks, accumulate intermediate summaries, and combine them hierarchically to mitigate memory constraints.
- Learned Chunk-Prioritization Policies: Allocating computation to the most promising context segments rather than exhaustively evaluating all inputs.
- Aggregation-Explicit Training: Fine-tuning protocols that explicitly incentivize correct long-range aggregation, potentially using synthetic settings analogous to Oolong-synth.
By systematically quantifying model breakdowns across context size, question structure, and answer types, Oolong-synth establishes a robust framework for diagnosing and improving long-context reasoning and aggregation in future LLMs.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free