Long-context LLMs Struggle with Long In-context Learning (2404.02060v3)
Abstract: LLMs have made significant strides in handling long sequences. Some models like Gemini could even to be capable of dealing with millions of tokens. However, their performance evaluation has largely been confined to metrics like perplexity and synthetic tasks, which may not fully capture their true abilities in more challenging, real-world scenarios. We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs. We believe LongICLBench could serve as a more realistic evaluation for the future long-context LLMs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Yi: Open foundation models by 01.ai, 2024.
- L-eval: Instituting standardized evaluation for long context language models, 2023.
- Exploring length generalization in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=zSkYVeX7bC4.
- Qwen technical report, 2023a.
- Longbench: A bilingual, multitask benchmark for long context understanding, 2023b.
- Sparse local embeddings for extreme multi-label classification. In Neural Information Processing Systems, 2015. URL https://api.semanticscholar.org/CorpusID:11419932.
- Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
- Efficient intent detection with dual sentence encoders. In Tsung-Hsien Wen, Asli Celikyilmaz, Zhou Yu, Alexandros Papangelis, Mihail Eric, Anuj Kumar, Iñigo Casanueva, and Rushin Shah (eds.), Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pp. 38–45, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.nlp4convai-1.5. URL https://aclanthology.org/2020.nlp4convai-1.5.
- Extending context window of large language models via positional interpolation. ArXiv, abs/2306.15595, 2023a. URL https://api.semanticscholar.org/CorpusID:259262376.
- Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2023b.
- A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4599–4610, 2021.
- GoEmotions: A dataset of fine-grained emotions. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4040–4054, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.372. URL https://aclanthology.org/2020.acl-main.372.
- Few-NERD: A few-shot named entity recognition dataset. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3198–3213, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.248. URL https://aclanthology.org/2021.acl-long.248.
- Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753, 2024.
- A survey on in-context learning, 2023.
- Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Structured prompting: Scaling in-context learning to 1, 000 examples. ArXiv, abs/2212.06713, 2022. URL https://api.semanticscholar.org/CorpusID:254591686.
- Mistral 7b, 2023.
- How long can context length of open-source LLMs truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a. URL https://openreview.net/forum?id=LywifFNXV5.
- Loogle: Can long-context language models understand long contexts?, 2023b.
- In-context learning with many demonstration examples, 2023c.
- What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (eds.), Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
- E^ 2-llm: Efficient and extreme length extension of large language models. arXiv preprint arXiv:2401.06951, 2024.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2023. URL https://api.semanticscholar.org/CorpusID:259360665.
- In-context learning for text classification with many labels, 2023.
- Landmark attention: Random-access infinite context length for transformers. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
- Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
- Resurrecting recurrent neural networks for long sequences. ArXiv, abs/2303.06349, 2023. URL https://api.semanticscholar.org/CorpusID:257496654.
- Rwkv: Reinventing rnns for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048–14077, 2023a.
- Yarn: Efficient context window extension of large language models, 2023b.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0.
- Parallel context windows for large language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6383–6402, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.352. URL https://aclanthology.org/2023.acl-long.352.
- Code llama: Open foundation models for code, 2024.
- Mining discourse markers for unsupervised sentence representation learning. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3477–3486, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1351. URL https://aclanthology.org/N19-1351.
- Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021. URL https://api.semanticscholar.org/CorpusID:233307138.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Focused transformer: Contrastive training for context scaling, 2023.
- ConvFiT: Conversational fine-tuning of pretrained language models. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1151–1168, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.88. URL https://aclanthology.org/2021.emnlp-main.88.
- Augmenting language models with long-term memory. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=BryMFPQ4L6.
- Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering, 2023.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- Dialogue-based relation extraction. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4927–4940, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.444. URL https://aclanthology.org/2020.acl-main.444.
- Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations, 2022.
- ∞\infty∞bench: Extending long context evaluation beyond 100k tokens, 2024.
- Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pp. 35–45, 2017. URL https://nlp.stanford.edu/pubs/zhang2017tacred.pdf.
- PoSE: Efficient context window extension of LLMs via positional skip-wise training. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3Z1gxuAQrA.