Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

L-Eval: Instituting Standardized Evaluation for Long Context Language Models (2307.11088v3)

Published 20 Jul 2023 in cs.CL
L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Abstract: Recently, there has been growing interest in extending the context length of LLMs, aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context LLMs (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k$\sim$200k tokens). On the other hand, we investigate the effectiveness in evalution metrics for LCLMs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment, and thus we strongly advocate for length-instruction-enhanced (LIE) evaluation and employing LLM judges. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of more principled evaluation of these models.

Overview of L-Eval: Instituting Standardized Evaluation for Long Context LLMs

The paper "L-Eval: Instituting Standardized Evaluation for Long Context LLMs" addresses a prominent challenge in the field of LLMs: extending the context length to effectively process long inputs in conversational or single-turn scenarios. Recognizing the strides made by proprietary models such as GPT-4 and Claude in maintaining reasoning capabilities with extended contexts, this work seeks to enhance open-source models by bridging the evaluation gap. The key proposal is an advanced evaluation benchmark for long context LLMs (LCLMs), termed L-Eval, that encompasses diverse datasets and metrics tailored to this emerging area.

Contributions

The paper's primary contribution is the creation of the L-Eval benchmark, which includes two core aspects:

  1. Dataset Construction: L-Eval offers a comprehensive evaluation suite with 20 sub-tasks, 508 long documents, and around 2,000 human-labeled query-response pairs. It covers varied question styles, domains, and input lengths from 3,000 to 200,000 tokens. The novel datasets provided cater to two task types—closed-ended tasks focusing on reasoning and understanding, and open-ended tasks that summarize documents.
  2. Evaluation Metrics: The authors critically address the shortcomings of conventional n-gram matching metrics in correlating with human judgment. They advocate for the adoption of length-instruction-enhanced (LIE) evaluation metrics alongside LLM judges to ensure better alignment with human evaluations. The improvement of automatic metrics, demonstrated by superior Kendall-Tau correlation coefficient scores with human judgments, underscores its utility.

Experimental Setup and Findings

The empirical analysis includes evaluations of four popular commercial models and twelve open-source LLMs using the L-Eval benchmark, offering several insights:

  • Performance Comparison: There remains a substantial performance disparity between open-source models and commercial entities, especially in closed-ended tasks. Although open-source models have advanced, their capabilities in open-ended tasks reflecting reasoning and detailed document summarization remain limited.
  • Model Shortcomings: Open-source LCLMs often falter in comprehending instructions as input length increases, especially in open-ended tasks. This results in challenges in instruction-following and coherent text generation.
  • Retrieval vs. Full-Context Models: Experiments with GPT-3.5-Turbo highlight that full-context models outperform retrieval-based systems in long-context dependent tasks, suggesting advantages in processing comprehensive input over fragmentary retrieval.
  • Efficiency and Scalability: The critique of scaled positional embeddings reveals mixed outcomes—they improve retrieval performance but potentially impair reasoning abilities in intricate tasks.

Implications and Future Directions

The paper establishes a foundation for the systematic evaluation and development of LCLMs. By providing a robust benchmark, it sets the stage for innovations in model architectures and evaluation techniques, emphasizing holistic context comprehension and instruction adherence.

In future developments, the paper raises enticing questions about refining LLM architectures to minimize instruction-following errors in long-context settings, and how to better incorporate diverse real-world applications into evaluation suites.

In conclusion, "L-Eval" significantly contributes to standardized assessments in LCLMs, offering a structured path for refining and benchmarking long context processing capabilities. The findings and proposals laid out in this work will likely inform the next wave of advancements in text generation models as they continue to evolve.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Extractive opinion summarization in quantized transformer spaces. Transactions of the Association for Computational Linguistics, 9:277–293, 2021. doi: 10.1162/tacl˙a˙00366. URL https://aclanthology.org/2021.tacl-1.17.
  2. Longbench: A bilingual, multitask benchmark for long context understanding, 2023.
  3. Scaling transformer to 1m tokens and beyond with rmt, 2023.
  4. Summscreen: A dataset for abstractive screenplay summarization, 2022.
  5. Extending context window of large language models via positional interpolation, 2023.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  7. Supervised and unsupervised transfer learning for question answering. In NAACL HLT, 2018.
  8. Training verifiers to solve math word problems, 2021.
  9. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285.
  10. Flashattention: Fast and memory-efficient exact attention with io-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  16344–16359. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf.
  11. A dataset of information-seeking questions and answers anchored in research papers, 2021.
  12. Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
  13. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320–335, 2022.
  14. Alpacafarm: A simulation framework for methods that learn from human feedback, 2023.
  15. Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model, 2019.
  16. MultiDoc2dial: Modeling dialogues grounded in multiple documents. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.498. URL https://doi.org/10.18653%2Fv1%2F2021.emnlp-main.498.
  17. Hungry hungry hippos: Towards language modeling with state space models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=COZDy0WYGg.
  18. Measuring massive multitask language understanding, 2021a.
  19. Cuad: An expert-annotated nlp dataset for legal contract review, 2021b.
  20. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1419–1436, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.112. URL https://aclanthology.org/2021.naacl-main.112.
  21. The narrativeqa reading comprehension challenge, 2017.
  22. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466, 2019. doi: 10.1162/tacl˙a˙00276. URL https://aclanthology.org/Q19-1026.
  23. How long can open-source llms truly promise on context length?, June 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
  24. In-context learning with many demonstration examples, 2023b.
  25. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023c.
  26. Unleashing infinite-length input capacity for large-scale language models with self-controlled memory system. arXiv preprint arXiv:2304.13343, 2023.
  27. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  28. Lost in the middle: How language models use long contexts, 2023.
  29. LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
  30. LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., June 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
  31. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  32. Long sequence modeling with xgen: A 7b llm trained on 8k input sequence length. Salesforce AI Research Blog, 2023. URL https://blog.salesforceairesearch.com/xgen.
  33. Quality: Question answering with long input texts, yes!, 2022.
  34. Rwkv: Reinventing rnns for the transformer era, 2023a.
  35. Yarn: Efficient context window extension of large language models, 2023b.
  36. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  37. Linearized relative positional encoding. CoRR, abs/2307.09270, 2023. doi: 10.48550/arXiv.2307.09270. URL https://doi.org/10.48550/arXiv.2307.09270.
  38. Scrolls: Standardized comparison over long language sequences, 2022.
  39. Zeroscrolls: A zero-shot benchmark for long text understanding, 2023.
  40. BIGPATENT: A large-scale dataset for abstractive and coherent summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2204–2213, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1212. URL https://aclanthology.org/P19-1212.
  41. Roformer: Enhanced transformer with rotary position embedding, 2022.
  42. Do long-range language models actually use long-range context? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  807–822, 2021.
  43. A length-extrapolatable transformer, 2022.
  44. Retentive network: A successor to transformer for large language models, 2023.
  45. Challenging big-bench tasks and whether chain-of-thought can solve them, 2022.
  46. Long range arena: A benchmark for efficient transformers, 2020.
  47. Llama: Open and efficient foundation language models, 2023a.
  48. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  49. Towards machine comprehension of spoken content: Initial toefl listening comprehension test by machine. In INTERSPEECH, 2016.
  50. Easyedit: An easy-to-use knowledge editing framework for large language models, 2023.
  51. Can we automate scientific reviewing?, 2021.
  52. Cab: Comprehensive attention benchmarking on long sequence modeling, 2023.
  53. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  54. Qmsum: A new benchmark for query-based multi-domain meeting summarization, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chenxin An (17 papers)
  2. Shansan Gong (14 papers)
  3. Ming Zhong (88 papers)
  4. Xingjian Zhao (4 papers)
  5. Mukai Li (17 papers)
  6. Jun Zhang (1008 papers)
  7. Lingpeng Kong (134 papers)
  8. Xipeng Qiu (257 papers)
Citations (99)