Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks (2404.06480v2)
Abstract: Recently, the LLM community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval.
- L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Internlm2 technical report.
- Walking down the memory maze: Beyond context limit through interactive reading. arXiv preprint arXiv:2310.05029.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052.
- A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
- Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Longt5: Efficient text-to-text transformer for long sequences. arXiv preprint arXiv:2112.07916.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
- The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
- Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209.
- How long can open-source llms truly promise on context length?
- Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172.
- Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- OpenAI. 2023. Gpt-4 technical report.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
- Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533.
- Jianlin Su. 2023. Rectified rotary position embeddings. https://github.com/bojone/rerope.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.
- Pearl: Prompting large language models to plan and execute actions over long documents. arXiv preprint arXiv:2305.14564.
- A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Chonghua Wang (2 papers)
- Haodong Duan (55 papers)
- Songyang Zhang (116 papers)
- Dahua Lin (336 papers)
- Kai Chen (512 papers)