Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Long-context LLMs Struggle with Long In-context Learning (2404.02060v3)

Published 2 Apr 2024 in cs.CL and cs.AI
Long-context LLMs Struggle with Long In-context Learning

Abstract: LLMs have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.

Long-context LLMs and Their Challenges with In-context Learning

Introduction to the Benchmark

Recent advancements in LLMs have ushered a new era of handling extensive text sequences, some exceeding 32K tokens. Yet, there exists a significant research gap in understanding these models' performance in nuanced real-life scenarios, particularly concerning long in-context learning. This paper introduces LongICLBench, a benchmark tailored to probe long in-context learning within the domain of extreme-label classification. Spanning six datasets with varying difficulty levels, this benchmark comprehensively evaluates 13 long-context LLMs, uncovering critical insights into their performance landscape.

Understanding the Benchmark

The benchmark encompasses datasets ranging in complexity, with label classes varying from 28 to 174 and token lengths extending from 2K to 50K. These datasets are engineered to necessitate a deep understanding of the entire input for accurate predictions. Upon evaluation, a distinct performance degradation is noted in models as the task complexity increases, with all models significantly struggling at the benchmark's apex, the Discovery dataset.

Insights from LongICLBench

The analysis delineates a stark contrast in model performances across the spectrum of datasets:

  • Models exhibit competent performance with shorter demonstrations, leveraging their long-context capabilities.
  • A steep decline in accuracy occurs as task complexity surges, particularly evident in models evaluated against the Discovery dataset.
  • An observed tendency among models to favor end-sequence labels suggests a positional bias and a lack of comprehensive reasoning over the entire input sequence.

Theoretical and Practical Implications

This research highlights the current LLMs' limitations in processing and understanding long, context-rich texts. The findings suggest that despite the significant strides made in enhancing LLMs' context window capabilities, there remains a palpable gap in these models' ability to engage in deep semantic understanding and reasoning over lengthy inputs. From a practical standpoint, this benchmark could serve as a critical tool in refining and evaluating future LLMs designed for long-context comprehension.

Future Directions in AI

The nuanced performance assessment conducted through LongICLBench underlines the necessity for continued innovation in the development of LLMs. Future research could focus on enhancing the models' ability to maintain semantic coherence over extended sequences and mitigating the observed positional biases. Additionally, exploring architectural innovations or training methodologies that bolster long-horizon reasoning capabilities could pave the way for LLMs that are truly adept at navigating complex, real-world scenarios.

Conclusion

The introduction of LongICLBench marks a pivotal step towards a more nuanced understanding of LLMs' capabilities in long in-context learning tasks. The benchmark's comprehensive evaluation uncovers critical insights, driving home the necessity for focused efforts to address the highlighted limitations. As the field continues to advance, LongICLBench will undoubtedly play a crucial role in shaping the trajectory of long-context model development, guiding researchers towards creating models that are not only technically sophisticated but also capable of nuanced understanding and reasoning across extensive texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Yi: Open foundation models by 01.ai, 2024.
  3. L-eval: Instituting standardized evaluation for long context language models, 2023.
  4. Exploring length generalization in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=zSkYVeX7bC4.
  5. Qwen technical report, 2023a.
  6. Longbench: A bilingual, multitask benchmark for long context understanding, 2023b.
  7. Sparse local embeddings for extreme multi-label classification. In Neural Information Processing Systems, 2015. URL https://api.semanticscholar.org/CorpusID:11419932.
  8. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  9. Efficient intent detection with dual sentence encoders. In Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah (eds.), Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pp.  38–45, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.nlp4convai-1.5. URL https://aclanthology.org/2020.nlp4convai-1.5.
  10. Extending context window of large language models via positional interpolation. ArXiv, abs/2306.15595, 2023a. URL https://api.semanticscholar.org/CorpusID:259262376.
  11. Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2023b.
  12. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4599–4610, 2021.
  13. GoEmotions: A dataset of fine-grained emotions. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4040–4054, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.372. URL https://aclanthology.org/2020.acl-main.372.
  14. Few-NERD: A few-shot named entity recognition dataset. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3198–3213, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.248. URL https://aclanthology.org/2021.acl-long.248.
  15. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024.
  16. A survey on in-context learning, 2023.
  17. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
  18. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  19. Structured prompting: Scaling in-context learning to 1, 000 examples. ArXiv, abs/2212.06713, 2022. URL https://api.semanticscholar.org/CorpusID:254591686.
  20. Mistral 7b, 2023.
  21. How long can context length of open-source LLMs truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a. URL https://openreview.net/forum?id=LywifFNXV5.
  22. Loogle: Can long-context language models understand long contexts?, 2023b.
  23. In-context learning with many demonstration examples, 2023c.
  24. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (eds.), Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
  25. E^ 2-llm: Efficient and extreme length extension of large language models. arXiv preprint arXiv:2401.06951, 2024.
  26. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2023. URL https://api.semanticscholar.org/CorpusID:259360665.
  27. In-context learning for text classification with many labels, 2023.
  28. Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
  29. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
  30. Resurrecting recurrent neural networks for long sequences. ArXiv, abs/2303.06349, 2023. URL https://api.semanticscholar.org/CorpusID:257496654.
  31. Rwkv: Reinventing rnns for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  14048–14077, 2023a.
  32. Yarn: Efficient context window extension of large language models, 2023b.
  33. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
  34. Parallel context windows for large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6383–6402, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.352. URL https://aclanthology.org/2023.acl-long.352.
  35. Code llama: Open foundation models for code, 2024.
  36. Mining discourse markers for unsupervised sentence representation learning. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3477–3486, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1351. URL https://aclanthology.org/N19-1351.
  37. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021. URL https://api.semanticscholar.org/CorpusID:233307138.
  38. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
  39. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
  40. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  41. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  42. Focused transformer: Contrastive training for context scaling, 2023.
  43. ConvFiT: Conversational fine-tuning of pretrained language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1151–1168, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.88. URL https://aclanthology.org/2021.emnlp-main.88.
  44. Augmenting language models with long-term memory. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=BryMFPQ4L6.
  45. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering, 2023.
  46. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
  47. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  48. Dialogue-based relation extraction. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4927–4940, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.444. URL https://aclanthology.org/2020.acl-main.444.
  49. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2022.
  50. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens, 2024.
  51. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pp.  35–45, 2017. URL https://nlp.stanford.edu/pubs/zhang2017tacred.pdf.
  52. PoSE: Efficient context window extension of LLMs via positional skip-wise training. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3Z1gxuAQrA.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tianle Li (25 papers)
  2. Ge Zhang (170 papers)
  3. Quy Duc Do (3 papers)
  4. Xiang Yue (72 papers)
  5. Wenhu Chen (134 papers)
Citations (90)
Youtube Logo Streamline Icon: https://streamlinehq.com