Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RULER: What's the Real Context Size of Your Long-Context Language Models? (2404.06654v3)

Published 9 Apr 2024 in cs.CL
RULER: What's the Real Context Size of Your Long-Context Language Models?

Abstract: The needle-in-a-haystack (NIAH) test, which examines the ability to retrieve a piece of information (the "needle") from long distractor texts (the "haystack"), has been widely adopted to evaluate long-context LLMs (LMs). However, this simple retrieval-based test is indicative of only a superficial form of long-context understanding. To provide a more comprehensive evaluation of long-context LMs, we create a new synthetic benchmark RULER with flexible configurations for customized sequence length and task complexity. RULER expands upon the vanilla NIAH test to encompass variations with diverse types and quantities of needles. Moreover, RULER introduces new task categories multi-hop tracing and aggregation to test behaviors beyond searching from context. We evaluate 17 long-context LMs with 13 representative tasks in RULER. Despite achieving nearly perfect accuracy in the vanilla NIAH test, almost all models exhibit large performance drops as the context length increases. While these models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K. Our analysis of Yi-34B, which supports context length of 200K, reveals large room for improvement as we increase input length and task complexity. We open source RULER to spur comprehensive evaluation of long-context LMs.

Expanding Long-Context Evaluation: Introducing Ruler for Comprehensive LLM Analysis

Overview of Ruler Benchmark

Researchers have developed Ruler, a synthetic benchmark designed for a comprehensive evaluation of long-context LLMs (LMs). Ruler advances beyond the traditional needle-in-a-haystack (NIAH) test by encompassing a wider range of tasks that evaluate not only retrieval capabilities but also multi-hop tracing, aggregation, and question answering within extended contexts. This benchmark is tailored to dissect long-context LMs' behaviors in scenarios that demand nuanced understanding and manipulation of context, addressing a gap in existing evaluation methodologies.

Task Categories in Ruler

Ruler is comprised of tasks grouped into four categories, each designed to probe different aspects of long-context LMs:

  1. Retrieval: Beyond the standard NIAH test, this category assesses models' abilities to retrieve information under various complexities, including the presence of distractors and the requirement to recall multiple related items.
  2. Multi-hop Tracing: Introducing tasks like variable tracking to evaluate models on their capacity to manage coreference chains and entity tracking over extended texts.
  3. Aggregation: Through tasks such as common and frequent words extraction, this domain probes models' abilities to synthesize and summarize information from large swaths of text.
  4. Question Answering: By inserting distracting information into input from existing short-context QA datasets, this category examines how well models can extract relevant answers from lengthy contexts.

Evaluation and Insights

The evaluation encompassed ten prominent long-context LMs across Ruler's 13 representative tasks. Results highlighted a notable performance degradation in more complex tasks as context length increased, even among models boasting context sizes greater than 32K tokens. Only a subset of models maintained robust performance at such lengths, with notable names including GPT-4, Command-R, Yi-34B, and Mixtral.

A detailed examination of Yi-34B, which claims a context length of 200K, underscored substantial opportunities for improvement, particularly in complex and prolonged input scenarios. This analysis revealed trends such as increased reliance on parametric knowledge and a propensity for models to directly copy content from context in non-retrieval tasks, underlining the crucial areas for future enhancements in long-context modeling.

Theoretical and Practical Implications

Ruler's introduction and the findings from its application underscore the evolutionary trajectory of long-context understanding in LMs. The nuanced testing framework it proposes moves beyond mere retrieval, opening avenues for exploring how LMs assimilate, recall, and synthesize information across expansive texts. The benchmark’s synthetic nature affords crucial advantages, including reduced dependence on pre-existing knowledge and enhanced control over task complexity.

Future Directions in AI

The insights gleaned from Ruler point towards several future directions. One immediate area is the optimization of models for enhanced performance across the new benchmark's tasks, particularly focusing on weaknesses in aggregation and multi-hop tracing capabilities. Additionally, the demonstrated need for models to efficiently manage longer contexts without resorting to copying suggests an avenue for architectural innovations. Finally, the exploration of non-Transformer architectures within this rigorous testing framework highlights the potential for diverse model designs to enhance long-context performance.

Ruler is open-sourced, encouraging further experimentation and adaptation. Its creation marks a significant step towards a more holistic understanding of long-context capabilities in LMs, promising to guide the next wave of advancements in generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. AI21. Introducing jamba: Ai21’s groundbreaking ssm-transformer model, 2024. URL https://www.ai21.com/blog/announcing-jamba.
  2. L-eval: Instituting standardized evaluation for long context language models. In ICLR, 2024.
  3. Anthropic. Long context prompting for Claude 2.1. Blog, 2023. URL https://www.anthropic.com/index/claude-2-1-prompting.
  4. Anthropic. Introducing the next generation of claude, 2024. URL https://www.anthropic.com/news/claude-3-family.
  5. Zoology: Measuring and improving recall in efficient language models. In ICLR, 2024.
  6. Yushi Bai et al. LongBench: A bilingual, multitask benchmark for long context understanding. arXiv:2308.14508, 2023.
  7. Scaling Transformer to 1M tokens and beyond with RMT. arXiv:2304.11062, 2023.
  8. Introducing GoodAI LTM benchmark. Blog, 2024. URL https://www.goodai.com/introducing-goodai-ltm-benchmark/.
  9. Extending context window of large language models via positional interpolation. In ICLR, 2023.
  10. LongLoRA: Efficient fine-tuning of long-context large language models. In ICLR, 2024.
  11. Generating long sequences with sparse Transformers. arXiv:1904.10509, 2019.
  12. Cohere. Command r, 2024. URL https://docs.cohere.com/docs/command-r#model-details.
  13. Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. arxiv:2307.08691, 2023.
  14. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
  15. Jiayu Ding et al. LongNet: Scaling Transformers to 1,000,000,000 tokens. arXiv:2307.02486, 2023.
  16. Yiran Ding et al. LongRoPE: Extending LLM context window beyond 2 million tokens. arXiv:2402.13753, 2024.
  17. Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. arXiv:2309.13345, 2023.
  18. GLM: General language model pretraining with autoregressive blank infilling. In Proc of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), pp.  320–335, 2022.
  19. Hungry Hungry Hippos: Towards language modeling with state space models. In ICLR, 2023a.
  20. Daniel Y. Fu et al. Simple hardware-efficient long convolutions for sequence modeling. ICML, 2023b.
  21. Yao Fu et al. Data engineering for scaling language models to 128k context. arXiv:2402.10171, 2024.
  22. Neural Turing machines. arXiv:1410.5401, 2014.
  23. Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752, 2023.
  24. Efficiently modeling long sequences with structured state spaces. In ICLR, 2022.
  25. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv:2308.16137, 2023.
  26. John J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proc of the National Academy of Sciences of the United States of America, 79 8:2554–8, 1982.
  27. Efficient long-text understanding with short-text models. Transactions of the ACL, 11:284–299, 2023.
  28. Sam Ade Jacobs et al. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence Transformer models. arXiv:2309.14509, 2023.
  29. Sebastian Jaszczur et al. Sparse is enough in scaling transformers. In NeurIPS, 2021.
  30. Albert Q Jiang et al. Mixtral of experts. arXiv:2401.04088, 2024.
  31. Huiqiang Jiang et al. LongLlmLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. arXiv:2310.06839, 2023.
  32. Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main.
  33. Lauri Karttunen. Discourse referents. In COLING, 1969.
  34. George Kingsley Zipf. Selected studies of the principle of relative frequency in language. Harvard university press, 1932.
  35. Woosuk Kwon et al. Efficient memory management for large language model serving with paged attention. In Proc. of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  36. How long can open-source LLMs truly promise on context length?, 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
  37. Loogle: Can long-context language models understand long contexts? arXiv:2311.04939, 2023b.
  38. Ring attention with blockwise Transformers for near-infinite context. In ICLR, 2023.
  39. World model on million-length video and language with Ring Attention. arxiv:2402.08268, 2024a.
  40. Jiaheng Liu et al. E2-LLM: Efficient and extreme length extension of large language models. arXiv:2401.06951, 2024b.
  41. Lost in the middle: How language models use long contexts. Transactions of the ACL, 12:157–173, 2024c.
  42. ∞\infty∞-former: Infinite memory Transformer. In Proc. of the 60th Annual Meeting of the ACL (Volume 1: Long Papers), 2022.
  43. Mistral.AI. La plateforme, 2023. URL https://mistral.ai/news/la-plateforme/.
  44. Landmark attention: Random-access infinite context length for Transformers. In Workshop on Efficient Systems for Foundation Models @ ICML, 2023.
  45. Vincent Ng. Supervised noun phrase coreference research: The first fifteen years. In Proc. of the 48th Annual Meeting of the ACL, 2010.
  46. Catherine Olsson et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  47. OpenAI: Josh Achiam et al. GPT-4 technical report. arXiv:2303.08774, 2023.
  48. Bo Peng et al. RWKV: Reinventing RNNs for the transformer era. In EMNLP, 2023.
  49. YaRN: Efficient context window extension of large language models. In ICLR, 2024.
  50. Hyena hierarchy: Towards larger convolutional language models. In ICML, 2023.
  51. Train short, test long: Attention with linear biases enables input length extrapolation. In ICLR, 2022.
  52. Know what you don’t know: Unanswerable questions for SQuAD. In Proc. of the 56th Annual Meeting of the ACL (Volume 2: Short Papers), 2018.
  53. Machel Reid et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530, 2024.
  54. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proc. of the 58th Annual Meeting of the ACL, 2020.
  55. ZeroSCROLLS: A zero-shot benchmark for long text understanding. In EMNLP, 2023.
  56. RoFormer: Enhanced Transformer with rotary position embedding. arXiv:2104.09864, 2023.
  57. ChapterBreak: A challenge dataset for long-range language models. In Proc. of the 2022 Conference of the North American Chapter of the ACL: Human Language Technologies, 2022.
  58. Retentive network: A successor to Transformer for large language models. arXiv:2307.08621, 2023a.
  59. A length-extrapolatable Transformer. In Proc. of the 61st Annual Meeting of the ACL (Volume 1: Long Papers), 2023b.
  60. A benchmark for learning to translate a new language from one grammar book. In ICLR, 2024.
  61. Yi Tay et al. Long Range Arena: A benchmark for efficient Transformers. In ICLR, 2021.
  62. Together.AI. Preparing for the era of 32k context: Early learnings and explorations, 2023a. URL https://www.together.ai/blog/llama-2-7b-32k.
  63. Together.AI. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, 2023b. URL https://www.together.ai/blog/llama-2-7b-32k-instruct.
  64. Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  65. Musique: Multihop questions via single-hop question composition. Transactions of the ACL, 10:539–554, 2022.
  66. Szymon Tworkowski et al. Focused Transformer: Contrastive training for context scaling. NeurIPS, 36, 2024.
  67. Teun A. van Dijk and Walter Kintsch. Strategies of discourse comprehension. In Academic Press, 1983.
  68. Attention is all you need. In NeurIPS, 2017.
  69. Augmenting language models with long-term memory. NeurIPS, 36, 2024.
  70. Thomas Wolf et al. Huggingface’s Transformers: State-of-the-art natural language processing. arXiv:1910.03771, 2019.
  71. Memformer: A memory-augmented Transformer for sequence modeling. In Findings of the ACL: AACL-IJCNLP, 2022.
  72. X.AI. Announcing grok-1.5, 2024. URL https://x.ai/blog/grok-1.5.
  73. Chaojun Xiao et al. InfLLM: Unveiling the intrinsic capacity of LLMs for understanding extremely long sequences with training-free memory. arXiv:2402.04617, 2024a.
  74. Efficient streaming language models with attention sinks. In ICLR, 2024b.
  75. Wenhan Xiong et al. Effective long-context scaling of foundation models. arXiv:2309.16039, 2023.
  76. Retrieval meets long context large language models. In ICLR, 2024.
  77. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP, 2018.
  78. Alex Young et al. Yi: Open foundation models by 01.AI. arXiv:2403.04652, 2024.
  79. Soaring from 4k to 400k: Extending LLM’s context with activation beacon. arXiv:2401.03462, 2024a.
  80. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens. arXiv:2402.13718, 2024b.
  81. PoSE: Efficient context window extension of LLMs via positional skip-wise training. In ICLR, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Cheng-Ping Hsieh (9 papers)
  2. Simeng Sun (23 papers)
  3. Samuel Kriman (8 papers)
  4. Shantanu Acharya (6 papers)
  5. Dima Rekesh (7 papers)
  6. Fei Jia (19 papers)
  7. Boris Ginsburg (111 papers)
  8. Yang Zhang (1129 papers)
Citations (117)
Youtube Logo Streamline Icon: https://streamlinehq.com