HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly (2410.02694v2)

Published 3 Oct 2024 in cs.CL and cs.AI

Abstract: There have been many benchmarks for evaluating long-context LLMs (LCLMs), but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrary subsets of tasks. It remains unclear whether they translate to the diverse downstream applications of LCLMs, and the inconsistency further complicates model comparison. We investigate the underlying reasons behind current practices and find that existing benchmarks often provide noisy signals due to low coverage of applications, insufficient lengths, unreliable metrics, and incompatibility with base models. In this work, we present HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address many issues in previous benchmarks by adding controllable lengths up to 128k tokens, model-based evaluation for reliable metrics, and few-shot prompting for robustly evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and consistent rankings of frontier LCLMs. Through a comprehensive study of 51 LCLMs, we find that (1) synthetic tasks like NIAH are not good predictors of downstream performance; (2) the diverse categories in HELMET exhibit distinct trends and low correlation with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning or following complex instructions -- the gap widens with increased lengths. Finally, we recommend using our RAG tasks for fast model development, as they are easy to run and more predictive of other downstream performance; ultimately, we advocate for a holistic evaluation across diverse tasks.

PDF HTML Abstract

Evaluation of Long-Context LLMs with HELMET

The paper introduces HELMET, a comprehensive benchmark designed to evaluate Long-Context LLMs (LCLMs). Existing benchmarks for LCLMs often rely on tasks such as needle-in-a-haystack (NIAH) or subsets of applications, which may not effectively translate to real-world performance or provide consistent model comparisons. HELMET addresses these problems by offering a diverse, application-centric benchmark that evaluates models across a wide array of categories with controlled settings.

Key Contributions

Benchmark Design:
- HELMET includes seven diverse categories: synthetic recall, long-document question answering (QA), summarization, many-shot in-context learning (ICL), retrieval-augmented generation (RAG), passage re-ranking, and generation with citations.
- It supports controllable input lengths up to 128k tokens to accommodate frontier model capabilities.
Issues with Current Benchmarks:
- Existing benchmarks lack sufficient downstream task coverage, often focus on synthetic tasks, have inadequate lengths, unreliable metrics, and require models to be instruction-tuned.
- HELMET improves upon these by providing datasets with necessary lengths and introducing model-based evaluation metrics to enhance consistency and reliability.
Studying Diverse LCLM Capabilities:
- A thorough evaluation of 51 LCLMs across various architectures and sizes was conducted.
- Notable findings include:
  - Synthetic tasks like NIAH are not strong indicators of downstream performance.
  - HELMET’s diverse categories show distinct trends and bide by low correlations among each other.
  - Open-source models significantly underperform closed-source models on complex tasks, further exacerbated by increased input lengths.

Numerical Results and Claims

The paper demonstrates that open-source LCLMs lag in tasks requiring full-context reasoning and instruction adherence at lengthy inputs, revealing a wider performance gap as context lengths expand.
HELMET achieves more reliable and consistent model rankings compared to benchmarks relying heavily on synthetic evaluations.

Implications and Future Directions

The paper suggests leveraging the RAG category within HELMET for rapid model development due to its ability to predict downstream performance effectively. The holistic assessment provided by HELMET across varied tasks impels the adoption of such comprehensive evaluation frameworks in ongoing LCLM development.

The research indicates a direction towards refining the model evaluation process for LCLMs, emphasizing the importance of holistic and diverse evaluation beyond synthetic tasks. This approach offers a comprehensive understanding of present models' capabilities and weaknesses, setting a foundation for future advancements in both model development and evaluation methodologies in AI.