Emergent Mind

Long-context LLMs Struggle with Long In-context Learning

(2404.02060)
Published Apr 2, 2024 in cs.CL and cs.AI

Abstract

LLMs have made significant strides in handling long sequences exceeding 32K tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their abilities in more nuanced, real-world scenarios. This study introduces a specialized benchmark (LongICLBench) focusing on long in-context learning within the realm of extreme-label classification. We meticulously selected six datasets with a label range spanning 28 to 174 classes covering different input (few-shot demonstration) lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate 13 long-context LLMs on our benchmarks. We find that the long-context LLMs perform relatively well on less challenging tasks with shorter demonstration lengths by effectively utilizing the long context window. However, on the most challenging task Discovery with 174 labels, all the LLMs struggle to understand the task definition, thus reaching a performance close to zero. This suggests a notable gap in current LLM capabilities for processing and understanding long, context-rich sequences. Further analysis revealed a tendency among models to favor predictions for labels presented toward the end of the sequence. Their ability to reason over multiple pieces in the long sequence is yet to be improved. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

Overview

  • The paper introduces LongICLBench, a benchmark for evaluating LLMs on long in-context learning tasks across six datasets with varying complexity.

  • It is revealed that LLMs exhibit a decline in performance as task complexity increases, especially with the Discovery dataset where models struggle significantly.

  • Analysis shows a model tendency towards end-sequence labels, indicating positional bias and a challenge in maintaining comprehensive reasoning over extended texts.

  • The findings emphasize the limitations of current LLMs in processing long, context-rich sequences and advocate for future research focused on enhancing semantic coherence and reasoning capabilities.

Long-context LLMs and Their Challenges with In-context Learning

Introduction to the Benchmark

Recent advancements in LLMs have ushered a new era of handling extensive text sequences, some exceeding 32K tokens. Yet, there exists a significant research gap in understanding these models' performance in nuanced real-life scenarios, particularly concerning long in-context learning. This paper introduces LongICLBench, a benchmark tailored to probe long in-context learning within the domain of extreme-label classification. Spanning six datasets with varying difficulty levels, this benchmark comprehensively evaluates 13 long-context LLMs, uncovering critical insights into their performance landscape.

Understanding the Benchmark

The benchmark encompasses datasets ranging in complexity, with label classes varying from 28 to 174 and token lengths extending from 2K to 50K. These datasets are engineered to necessitate a deep understanding of the entire input for accurate predictions. Upon evaluation, a distinct performance degradation is noted in models as the task complexity increases, with all models significantly struggling at the benchmark's apex, the Discovery dataset.

Insights from LongICLBench

The analysis delineates a stark contrast in model performances across the spectrum of datasets:

  • Models exhibit competent performance with shorter demonstrations, leveraging their long-context capabilities.
  • A steep decline in accuracy occurs as task complexity surges, particularly evident in models evaluated against the Discovery dataset.
  • An observed tendency among models to favor end-sequence labels suggests a positional bias and a lack of comprehensive reasoning over the entire input sequence.

Theoretical and Practical Implications

This research highlights the current LLMs' limitations in processing and understanding long, context-rich texts. The findings suggest that despite the significant strides made in enhancing LLMs' context window capabilities, there remains a palpable gap in these models' ability to engage in deep semantic understanding and reasoning over lengthy inputs. From a practical standpoint, this benchmark could serve as a critical tool in refining and evaluating future LLMs designed for long-context comprehension.

Future Directions in AI

The nuanced performance assessment conducted through LongICLBench underlines the necessity for continued innovation in the development of LLMs. Future research could focus on enhancing the models' ability to maintain semantic coherence over extended sequences and mitigating the observed positional biases. Additionally, exploring architectural innovations or training methodologies that bolster long-horizon reasoning capabilities could pave the way for LLMs that are truly adept at navigating complex, real-world scenarios.

Conclusion

The introduction of LongICLBench marks a pivotal step towards a more nuanced understanding of LLMs' capabilities in long in-context learning tasks. The benchmark's comprehensive evaluation uncovers critical insights, driving home the necessity for focused efforts to address the highlighted limitations. As the field continues to advance, LongICLBench will undoubtedly play a crucial role in shaping the trajectory of long-context model development, guiding researchers towards creating models that are not only technically sophisticated but also capable of nuanced understanding and reasoning across extensive texts.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. GPT-4 Technical Report
  2. Yi: Open foundation models by 01.ai
  3. L-eval: Instituting standardized evaluation for long context language models
  4. Exploring length generalization in LLMs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=zSkYVeX7bC4.

  5. Qwen technical report, 2023a
  6. Longbench: A bilingual, multitask benchmark for long context understanding, 2023b
  7. Sparse local embeddings for extreme multi-label classification. In Neural Information Processing Systems, 2015. https://api.semanticscholar.org/CorpusID:11419932.

  8. InternLM2 Technical Report
  9. Efficient intent detection with dual sentence encoders. In Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah (eds.), Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pp.  38–45, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.nlp4convai-1.5. https://aclanthology.org/2020.nlp4convai-1.5.

  10. Extending Context Window of Large Language Models via Positional Interpolation
  11. Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2023b.
  12. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4599–4610
  13. GoEmotions: A dataset of fine-grained emotions. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4040–4054, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.372. https://aclanthology.org/2020.acl-main.372.

  14. Few-NERD: A few-shot named entity recognition dataset. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3198–3213, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.248. https://aclanthology.org/2021.acl-long.248.

  15. LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
  16. A survey on in-context learning
  17. Data Engineering for Scaling Language Models to 128K Context
  18. Mamba: Linear-Time Sequence Modeling with Selective State Spaces
  19. Structured Prompting: Scaling In-Context Learning to 1,000 Examples
  20. Mistral 7b
  21. How long can context length of open-source LLMs truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a. https://openreview.net/forum?id=LywifFNXV5.

  22. Loogle: Can long-context language models understand long contexts?, 2023b
  23. In-context learning with many demonstration examples, 2023c
  24. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (eds.), Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. https://aclanthology.org/2022.deelio-1.10.

  25. E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
  26. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2023. https://api.semanticscholar.org/CorpusID:259360665.

  27. In-context learning for text classification with many labels
  28. Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models@ ICML2023
  29. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the AAAI conference on artificial intelligence, volume 31
  30. Resurrecting Recurrent Neural Networks for Long Sequences
  31. Rwkv: Reinventing rnns for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  14048–14077, 2023a.
  32. Yarn: Efficient context window extension of large language models, 2023b
  33. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. https://openreview.net/forum?id=R8sQPpGCv0.

  34. Parallel context windows for LLMs. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6383–6402, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.352. https://aclanthology.org/2023.acl-long.352.

  35. Code llama: Open foundation models for code
  36. Mining discourse markers for unsupervised sentence representation learning. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3477–3486, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1351. https://aclanthology.org/N19-1351.

  37. RoFormer: Enhanced Transformer with Rotary Position Embedding
  38. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063
  39. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=qVyeW-grC2k.

  40. Gemini: A Family of Highly Capable Multimodal Models
  41. Gemma: Open Models Based on Gemini Research and Technology
  42. Focused transformer: Contrastive training for context scaling
  43. ConvFiT: Conversational fine-tuning of pretrained language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1151–1168, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.88. https://aclanthology.org/2021.emnlp-main.88.

  44. Augmenting language models with long-term memory. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=BryMFPQ4L6.

  45. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering
  46. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. https://openreview.net/forum?id=NG7sS51zVF.

  47. Effective Long-Context Scaling of Foundation Models
  48. Dialogue-based relation extraction. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4927–4940, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.444. https://aclanthology.org/2020.acl-main.444.

  49. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations
  50. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens
  51. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pp.  35–45, 2017. https://nlp.stanford.edu/pubs/zhang2017tacred.pdf.

  52. PoSE: Efficient context window extension of LLMs via positional skip-wise training. In The Twelfth International Conference on Learning Representations, 2024. https://openreview.net/forum?id=3Z1gxuAQrA.

Show All 52