L-Eval: Instituting Standardized Evaluation for Long Context Language Models (2307.11088v3)

Published 20 Jul 2023 in cs.CL

Abstract: Recently, there has been growing interest in extending the context length of LLMs, aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context LLMs (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k$\sim$200k tokens). On the other hand, we investigate the effectiveness in evalution metrics for LCLMs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment, and thus we strongly advocate for length-instruction-enhanced (LIE) evaluation and employing LLM judges. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of more principled evaluation of these models.

PDF HTML Abstract

Overview of L-Eval: Instituting Standardized Evaluation for Long Context LLMs

The paper "L-Eval: Instituting Standardized Evaluation for Long Context LLMs" addresses a prominent challenge in the field of LLMs: extending the context length to effectively process long inputs in conversational or single-turn scenarios. Recognizing the strides made by proprietary models such as GPT-4 and Claude in maintaining reasoning capabilities with extended contexts, this work seeks to enhance open-source models by bridging the evaluation gap. The key proposal is an advanced evaluation benchmark for long context LLMs (LCLMs), termed L-Eval, that encompasses diverse datasets and metrics tailored to this emerging area.

Contributions

The paper's primary contribution is the creation of the L-Eval benchmark, which includes two core aspects:

Dataset Construction: L-Eval offers a comprehensive evaluation suite with 20 sub-tasks, 508 long documents, and around 2,000 human-labeled query-response pairs. It covers varied question styles, domains, and input lengths from 3,000 to 200,000 tokens. The novel datasets provided cater to two task types—closed-ended tasks focusing on reasoning and understanding, and open-ended tasks that summarize documents.
Evaluation Metrics: The authors critically address the shortcomings of conventional n-gram matching metrics in correlating with human judgment. They advocate for the adoption of length-instruction-enhanced (LIE) evaluation metrics alongside LLM judges to ensure better alignment with human evaluations. The improvement of automatic metrics, demonstrated by superior Kendall-Tau correlation coefficient scores with human judgments, underscores its utility.

Experimental Setup and Findings

The empirical analysis includes evaluations of four popular commercial models and twelve open-source LLMs using the L-Eval benchmark, offering several insights:

Performance Comparison: There remains a substantial performance disparity between open-source models and commercial entities, especially in closed-ended tasks. Although open-source models have advanced, their capabilities in open-ended tasks reflecting reasoning and detailed document summarization remain limited.
Model Shortcomings: Open-source LCLMs often falter in comprehending instructions as input length increases, especially in open-ended tasks. This results in challenges in instruction-following and coherent text generation.
Retrieval vs. Full-Context Models: Experiments with GPT-3.5-Turbo highlight that full-context models outperform retrieval-based systems in long-context dependent tasks, suggesting advantages in processing comprehensive input over fragmentary retrieval.
Efficiency and Scalability: The critique of scaled positional embeddings reveals mixed outcomes—they improve retrieval performance but potentially impair reasoning abilities in intricate tasks.

Implications and Future Directions

The paper establishes a foundation for the systematic evaluation and development of LCLMs. By providing a robust benchmark, it sets the stage for innovations in model architectures and evaluation techniques, emphasizing holistic context comprehension and instruction adherence.

In future developments, the paper raises enticing questions about refining LLM architectures to minimize instruction-following errors in long-context settings, and how to better incorporate diverse real-world applications into evaluation suites.

In conclusion, "L-Eval" significantly contributes to standardized assessments in LCLMs, offering a structured path for refining and benchmarking long context processing capabilities. The findings and proposals laid out in this work will likely inform the next wave of advancements in text generation models as they continue to evolve.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (8)

Chenxin An (17 papers)
Shansan Gong (14 papers)
Ming Zhong (88 papers)
Xingjian Zhao (4 papers)
Mukai Li (17 papers)
Jun Zhang (1008 papers)
Lingpeng Kong (134 papers)
Xipeng Qiu (257 papers)

Citations (99)

View on Semantic Scholar

GitHub

GitHub - OpenLMLab/LEval: Data and code for L-Eval, a comprehensive long context language models evaluation benchmark (295 stars)