Long-context LLMs Struggle with Long In-context Learning (2404.02060v3)

Published 2 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

References (52)

Citations (90)

View on Semantic Scholar

Summary

The paper introduces LongICLBench, a benchmark showing that LLMs struggle with long in-context tasks, particularly as task complexity increases.
It evaluates 13 models across six datasets with token lengths from 2K to 50K, uncovering a distinct drop in accuracy and a bias towards end-sequence labels.
The findings highlight a pressing need for model innovations to enhance semantic comprehension and mitigate positional biases in extended texts.

Long-context LLMs and Their Challenges with In-context Learning

Introduction to the Benchmark

Recent advancements in LLMs have ushered a new era of handling extensive text sequences, some exceeding 32K tokens. Yet, there exists a significant research gap in understanding these models' performance in nuanced real-life scenarios, particularly concerning long in-context learning. This paper introduces LongICLBench, a benchmark tailored to probe long in-context learning within the domain of extreme-label classification. Spanning six datasets with varying difficulty levels, this benchmark comprehensively evaluates 13 long-context LLMs, uncovering critical insights into their performance landscape.

Understanding the Benchmark

The benchmark encompasses datasets ranging in complexity, with label classes varying from 28 to 174 and token lengths extending from 2K to 50K. These datasets are engineered to necessitate a deep understanding of the entire input for accurate predictions. Upon evaluation, a distinct performance degradation is noted in models as the task complexity increases, with all models significantly struggling at the benchmark's apex, the Discovery dataset.

Insights from LongICLBench

The analysis delineates a stark contrast in model performances across the spectrum of datasets:

Models exhibit competent performance with shorter demonstrations, leveraging their long-context capabilities.
A steep decline in accuracy occurs as task complexity surges, particularly evident in models evaluated against the Discovery dataset.
An observed tendency among models to favor end-sequence labels suggests a positional bias and a lack of comprehensive reasoning over the entire input sequence.

Theoretical and Practical Implications

This research highlights the current LLMs' limitations in processing and understanding long, context-rich texts. The findings suggest that despite the significant strides made in enhancing LLMs' context window capabilities, there remains a palpable gap in these models' ability to engage in deep semantic understanding and reasoning over lengthy inputs. From a practical standpoint, this benchmark could serve as a critical tool in refining and evaluating future LLMs designed for long-context comprehension.

Future Directions in AI

The nuanced performance assessment conducted through LongICLBench underlines the necessity for continued innovation in the development of LLMs. Future research could focus on enhancing the models' ability to maintain semantic coherence over extended sequences and mitigating the observed positional biases. Additionally, exploring architectural innovations or training methodologies that bolster long-horizon reasoning capabilities could pave the way for LLMs that are truly adept at navigating complex, real-world scenarios.

Conclusion

The introduction of LongICLBench marks a pivotal step towards a more nuanced understanding of LLMs' capabilities in long in-context learning tasks. The benchmark's comprehensive evaluation uncovers critical insights, driving home the necessity for focused efforts to address the highlighted limitations. As the field continues to advance, LongICLBench will undoubtedly play a crucial role in shaping the trajectory of long-context model development, guiding researchers towards creating models that are not only technically sophisticated but also capable of nuanced understanding and reasoning across extensive texts.

PDF Markdown

Related Papers

GitHub

GitHub - TIGER-AI-Lab/LongICLBench: Code and Data for "Long-context LLMs Struggle with Long In-context Learning" (68 stars)

Tweets

https://twitter.com/omarsar0/status/1775638933377786076

https://twitter.com/arankomatsuzaki/status/1775345299701002690

https://twitter.com/_akhaliq/status/1775346771662704762

https://twitter.com/WenhuChen/status/1775365973412888839

https://twitter.com/TianleLI123/status/1775526037729796298

https://twitter.com/Bluechip_AI/status/1777537390212718644

YouTube

Show All Videos