Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens (2402.13718v3)

Published 21 Feb 2024 in cs.CL
$\infty$Bench: Extending Long Context Evaluation Beyond 100K Tokens

Abstract: Processing and reasoning over long contexts is crucial for many practical applications of LLMs, such as document comprehension and agent construction. Despite recent strides in making LLMs process contexts with more than 100K tokens, there is currently a lack of a standardized benchmark to evaluate this long-context capability. Existing public benchmarks typically focus on contexts around 10K tokens, limiting the assessment and comparison of LLMs in processing longer contexts. In this paper, we propose $\infty$Bench, the first LLM benchmark featuring an average data length surpassing 100K tokens. $\infty$Bench comprises synthetic and realistic tasks spanning diverse domains, presented in both English and Chinese. The tasks in $\infty$Bench are designed to require well understanding of long dependencies in contexts, and make simply retrieving a limited number of passages from contexts not sufficient for these tasks. In our experiments, based on $\infty$Bench, we evaluate the state-of-the-art proprietary and open-source LLMs tailored for processing long contexts. The results indicate that existing long context LLMs still require significant advancements to effectively process 100K+ context. We further present three intriguing analyses regarding the behavior of LLMs processing long context.

Introducing ∞Bench: A New Benchmark for Evaluating Long-Context Processing in LLMs

Overview

The rapid advancements in LLMs have significantly enhanced their performance across various NLP tasks. Yet, the challenge of effectively processing and reasoning over long contexts exceeding 100K tokens persists, underscoring the necessity for comprehensive benchmarks tailored to evaluate such capabilities. This paper presents ∞Bench, the pioneering benchmark that surpasses the conventional context length, featuring a diverse set of tasks across multiple domains and languages, aimed at pushing the boundaries of LLMs in handling long-context information.

Existing Benchmarks and Their Limitations

Prior benchmarks have predominantly focused on contexts approximately around 10K tokens, limiting the evaluation of LLMs' ability to process and understand significantly longer texts. Comparatively, ∞Bench stands out not only for its unprecedented average data length but also for its inclusion of tasks in both English and Chinese, spanning domains such as novels, code, mathematics, and dialogue, among others.

Evolving the Evaluation of Long-Context LLMs

∞Bench addresses the critical need for standardized evaluation in the domain of long-context processing by integrating tasks that challenge models beyond mere retrieval practices. These tasks, both synthetic and human-annotated, are designed to test LLMs' depth of understanding and reasoning over complex, lengthy contexts. The benchmark represents an effort to simulate real-world applications requiring comprehensive understanding, such as document summarization, question answering, and code debugging, all within the realms of extensive textual content.

Task Design and Implementation

The tasks within ∞Bench cover a wide spectrum, categorized under realistic and synthetic contexts. Realistic tasks draw from domains like novel-based reasoning and dialogue analysis, while synthetic tasks probe models on their ability to process artificial constructs, such as extended numerical retrieval and sequential computation. This blend of task types is formulated to challenge and assess the long-context capabilities of LLMs under varied, demanding scenarios.

Experimental Insights

Experimental evaluations using ∞Bench reveal significant insights into the current state and limitations of SOTA LLMs in processing extended contexts. The observed degradation in performance as the context length increases underscores an imperative for advanced methodologies aimed at bolstering long-context understanding and processing efficiency. Moreover, the intriguing findings from length ablation and context recalling experiments offer promising directions for future research and model improvement.

The Road Ahead

The introduction of ∞Bench marks a crucial step towards refining the assessment of LLMs in handling long contexts. The insights garnered from the benchmark's initial deployment highlight the necessity for continued innovation in model architecture and training methodologies. As LLMs evolve, so too must the benchmarks used to evaluate their capabilities, ensuring they remain relevant and challenging in the face of rapid technological advancement.

Acknowledging Constraints

While ∞Bench provides a novel approach to evaluating LLMs, its scope, like all benchmarks, is confined by the selection and design of its tasks. Future iterations may need to explore even longer contexts or diversify the task domains further to offer a more exhaustive assessment of LLM capabilities. Additionally, the precise impact of benchmark constraints and scoring criteria on model evaluation warrants careful consideration.

Ethical Considerations

In developing and deploying ∞Bench, meticulous attention to ethical implications is paramount. Efforts to mitigate the potential for bias, misuse, and sensitivity in task content are essential, underscoring the balance between advancing AI capabilities and ensuring responsible development practices.

∞Bench represents a significant advancement in the field of LLM evaluation, offering a rigorous, comprehensive benchmark designed specifically for the assessment of long-context processing. The insights and challenges highlighted by this benchmark pave the way for future research and development aimed at unlocking the full potential of LLMs in understanding and reasoning over extensive texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. 01.AI. 2023a. Yi-34b-200k. https://huggingface.co/01-ai/Yi-34B-200K.
  2. 01.AI. 2023b. Yi-6b-200k. https://huggingface.co/01-ai/Yi-6B-200K.
  3. Moonshot AI. 2023. Kimi chat. https://kimi.moonshot.cn/.
  4. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. ArXiv, abs/2305.13245.
  5. L-eval: Instituting standardized evaluation for long context language models. ArXiv, abs/2307.11088.
  6. Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556.
  7. Anthropic. 2023. Model card and evaluations for claude models.
  8. Longbench: A bilingual, multitask benchmark for long context understanding.
  9. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  10. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  11. Language models are few-shot learners. CoRR, abs/2005.14165.
  12. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  13. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  14. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
  15. Extending context window of large language models via positional interpolation. ArXiv, abs/2306.15595.
  16. Tri Dao. 2023. FlashAttention-2: Faster attention with better parallelism and work partitioning.
  17. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  18. Flash-decoding for long-context inference.
  19. A dataset of information-seeking questions and answers anchored in research papers. ArXiv, abs/2105.03011.
  20. A survey on long text modeling with transformers. arXiv preprint arXiv:2302.14502.
  21. Lm-infinite: Simple on-the-fly length generalization for large language models. ArXiv, abs/2308.16137.
  22. Pre-trained models: Past, present and future. AI Open, 2:225–250.
  23. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  24. Flashdecoding++: Faster large language model inference on gpus.
  25. Efficient attentions for long document summarization. ArXiv, abs/2104.02112.
  26. Advancing transformer architecture in long-context large language models: A comprehensive survey. arXiv preprint arXiv:2311.12351.
  27. Mistral 7b.
  28. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. ArXiv, abs/1705.03551.
  29. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
  30. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  31. Loogle: Can long-context language models understand long contexts? ArXiv, abs/2311.04939.
  32. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  33. Lost in the middle: How language models use long contexts.
  34. Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark attention: Random-access infinite context length for transformers. ArXiv, abs/2305.16300.
  35. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  36. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15.
  37. OpenAI. 2023a. Gpt-4 technical report. ArXiv, abs/2303.08774.
  38. OpenAI. 2023b. Gpt-4 turbo.
  39. OpenAI. 2023c. Tiktoken.
  40. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  41. Rwkv: Reinventing rnns for the transformer era.
  42. Yarn: Efficient context window extension of large language models.
  43. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
  44. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 63(10):1872–1897.
  45. Noam M. Shazeer. 2019. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150.
  46. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053.
  47. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063.
  48. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
  49. Long range arena: A benchmark for efficient transformers.
  50. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  51. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  52. Efficient streaming language models with attention sinks. ArXiv, abs/2309.17453.
  53. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing.
  54. Pose: Efficient context window extension of llms via positional skip-wise training. arXiv preprint arXiv:2309.10400.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xinrong Zhang (9 papers)
  2. Yingfa Chen (11 papers)
  3. Shengding Hu (34 papers)
  4. Zihang Xu (11 papers)
  5. Junhao Chen (36 papers)
  6. Moo Khai Hao (1 paper)
  7. Xu Han (270 papers)
  8. Zhen Leng Thai (4 papers)
  9. Shuo Wang (382 papers)
  10. Zhiyuan Liu (433 papers)
  11. Maosong Sun (337 papers)
Citations (76)
Youtube Logo Streamline Icon: https://streamlinehq.com