LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (2308.14508v2)

Published 28 Aug 2023 in cs.CL

Abstract: Although LLMs demonstrate impressive performance for many language tasks, most of them can only handle texts a few thousand tokens long, limiting their applications on longer sequence inputs, such as books, reports, and codebases. Recent works have proposed methods to improve LLMs' long context capabilities by extending context windows and more sophisticated memory mechanisms. However, comprehensive benchmarks tailored for evaluating long context understanding are lacking. In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding. LongBench comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese). These tasks cover key long-text application areas including single-doc QA, multi-doc QA, summarization, few-shot learning, synthetic tasks, and code completion. All datasets in LongBench are standardized into a unified format, allowing for effortless automatic evaluation of LLMs. Upon comprehensive evaluation of 8 LLMs on LongBench, we find that: (1) Commercial model (GPT-3.5-Turbo-16k) outperforms other open-sourced models, but still struggles on longer contexts. (2) Scaled position embedding and fine-tuning on longer sequences lead to substantial improvement on long context understanding. (3) Context compression technique such as retrieval brings improvement for model with weak ability on long contexts, but the performance still lags behind models that have strong long context understanding capability. The code and datasets are available at https://github.com/THUDM/LongBench.

PDF Abstract

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Overview

The paper "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding" introduces LongBench, an innovative benchmark specifically designed to evaluate the long context understanding capabilities of LLMs. The benchmark addresses a crucial limitation of current LLMs, which often struggle with processing and understanding texts beyond a few thousand tokens. LongBench represents a significant effort to provide a rigorous framework for assessing LLMs' abilities to handle extended sequences present in books, reports, and codebases, across diverse languages and tasks.

Benchmark Design

LongBench distinguishes itself by its breadth and structure, encompassing 21 datasets across six key task categories: Single-Document QA, Multi-Document QA, Summarization, Few-shot Learning, Synthetic Tasks, and Code Completion. It includes bilingual datasets in both English and Chinese, adding a layer of complexity and comprehensiveness to the evaluation.

Single-Doc QA & Multi-Doc QA: These tasks aim to evaluate how well models can extract and integrate information from single or multiple documents.
Summarization: This category tests the models' abilities to condense detailed documents into concise summaries, highlighting global context understanding.
Few-Shot Learning: Few-shot scenarios test the adaptability of LLMs to leverage minimal examples for various tasks, simulating practical constraints.
Synthetic Tasks: These controlled tasks focus on specific long-context dependencies, offering insights into the models' internal representations and scaling behavior.
Code Completion: By introducing tasks at both file and repository levels, LongBench examines models’ capacities to understand programming code over extended contexts.

Each dataset has been standardized for ease of automatic evaluation, employing metrics like ROUGE-L and F1 scores.

Experimental Evaluation

The paper reports a comprehensive assessment of eight diverse LLMs, ranging from open-source to commercial models like GPT-3.5-Turbo-16k. Notable findings include:

GPT-3.5-Turbo-16k consistently outperforms its peers, yet still encounters challenges with longer contexts.
Techniques like scaled positional embeddings and fine-tuning on long sequences (as seen in models such as LongChat and ChatGLM2) significantly enhance long context performance.
Context compression methods yield improvements for underperforming models, but such methods still underdeliver compared to inherently robust models.

The analysis included tailored versions of LongBench like LongBench-E, which emphasize varied sequence lengths to discern the models' length sensitivity independent of task complexity.

Implications and Future Directions

The development of LongBench opens several avenues for both theoretical and applied insights:

Practical Impacts: The benchmark supports developers in identifying strengths and weaknesses in model designs, particularly for applications requiring comprehensive document understanding, code analysis, and multilingual capabilities.
Theoretical Insights: Exploring the impact of context length on model performance could reveal deeper insights into LLMs' attention mechanisms and potential architectural improvements.
Innovation in Model Design: The paper’s conclusions suggest the need for novel architectures capable of efficiently handling longer contexts, potentially integrating advanced memory mechanisms or novel positional encoding techniques.

LongBench's balanced design in terms of task diversity and bilingual focus provides a valuable tool for future advancements in AI, contributing to more robust and adaptable LLMs capable of tackling real-world complexities involving extended textual materials.

Overall, LongBench represents a significant step forward in the ongoing development of benchmarks tailored for the evolving capacities of LLMs, marking a vital contribution to both the academic community and industry practitioners focusing on natural language processing and long-form text applications.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Yushi Bai (31 papers)
Xin Lv (38 papers)
Jiajie Zhang (30 papers)
Hongchang Lyu (1 paper)
Jiankai Tang (17 papers)
Zhidian Huang (2 papers)
Zhengxiao Du (22 papers)
Xiao Liu (402 papers)
Aohan Zeng (19 papers)
Lei Hou (127 papers)
Yuxiao Dong (119 papers)
Jie Tang (302 papers)
Juanzi Li (144 papers)

Citations (312)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - THUDM/LongBench: LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (608 stars)

Tweets

https://twitter.com/state_equation/status/1842159865780945335