Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Too Long, Didn't Model: Decomposing LLM Long-Context Understanding With Novels (2505.14925v1)

Published 20 May 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Although the context length of LLMs has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Inspired by work on computational novel analysis, we release the Too Long, Didn't Model (TLDM) benchmark, which tests a model's ability to report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest LLM developers must look beyond "lost in the middle" benchmarks when evaluating model performance in complex long-context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sil Hamilton (7 papers)
  2. Rebecca M. M. Hicke (10 papers)
  3. Matthew Wilkens (7 papers)
  4. David Mimno (44 papers)

Summary

Evaluating Long-Context Understanding in LLMs

The paper "Too Long, Didn't Model: Decomposing LLM Long-Context Understanding with Novels" presents a novel approach to evaluate the ability of LLMs to process and understand extended contexts. This investigation emerges from the observation that while LLMs can handle inputs comprising millions of tokens, their capability to integrate and utilize information from long contexts remains questionable.

TLDM Benchmark

This work introduces the Too Long, Didn't Model (TLDM) benchmark, aimed at assessing LLMs' comprehension of complex and lengthy narratives exemplified by novels. The benchmark involves narrative tasks focusing on aspects like plot summarization, storyworld configuration, and narrative time estimation. The authors employed forty English-language novels with token counts ranging from less than 32k up to over 128k to probe various narrative understanding dimensions.

Findings

The evaluation encompassed seven cutting-edge LLMs, showcasing context windows extending up to 10 million tokens; however, the findings indicate that stable comprehension deteriorates past 64k tokens. This outcome signifies a critical threshold beyond which LLMs struggle to maintain coherence and accuracy in narrative tasks. A notable correlation arises between a model's size and its long-context performance, suggesting larger-scale models exhibit superior integration capabilities.

Experimental Treatments

The research adopts distinctive treatments, such as shuffling chapters and truncating texts, to determine their influence on model performance in long-context scenarios. Notably, treatment effects reveal that the models generally suffer a performance decline with longer texts and shuffled contexts, implying an inherent limitation in their sequential narrative processing strategies.

Theoretical and Practical Implications

The paper underscores the necessity for future LLM developers to transcend conventional benchmarks and cultivate models with robust mechanisms for long-range semantic integration. This exploration heralds theoretical implications regarding the design of model architectures that can efficiently scale depth with input length. Practically, improving models' capability to process long-range contexts could revolutionize applications involving extensive textual data, such as comprehensive document analysis and nuanced storytelling.

Speculations on Future AI Developments

Continued advancements in understanding how LLMs process long contexts will lay the groundwork for more sophisticated AI models, with enhanced narrative comprehension. These models may leverage mechanistic interpretability to emulate human-like state tracking and story integration functions, potentially transforming AI applications in domains requiring intricate and extended data assimilation.

Conclusion

This paper provides a substantive contribution to the discourse on LLMs' narrative comprehension capabilities. The TLDM benchmark represents an essential step in probing the boundaries of current model methodologies. As the field progresses, addressing the deficits in long-context understanding will be pivotal to realizing the full potential of AI's narrative processing power.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com