Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-Context Learning with Long-Context Models: An In-Depth Exploration (2405.00200v1)

Published 30 Apr 2024 in cs.CL
In-Context Learning with Long-Context Models: An In-Depth Exploration

Abstract: As model context lengths continue to increase, the number of demonstrations that can be provided in-context approaches the size of entire training datasets. We study the behavior of in-context learning (ICL) at this extreme scale on multiple datasets and models. We show that, for many datasets with large label spaces, performance continues to increase with hundreds or thousands of demonstrations. We contrast this with example retrieval and finetuning: example retrieval shows excellent performance at low context lengths but has diminished gains with more demonstrations; finetuning is more data hungry than ICL but can sometimes exceed long-context ICL performance with additional data. We use this ICL setting as a testbed to study several properties of both in-context learning and long-context models. We show that long-context ICL is less sensitive to random input shuffling than short-context ICL, that grouping of same-label examples can negatively impact performance, and that the performance boosts we see do not arise from cumulative gain from encoding many examples together. We conclude that although long-context ICL can be surprisingly effective, most of this gain comes from attending back to similar examples rather than task learning.

Exploring the Depths of In-Context Learning with Long Context Models

Introduction to In-Context Learning with Long Contexts

With advancements in LLMs, the field of in-context learning (ICL) has been revolutionizing how machines understand and process vast amounts of information. Traditionally focused on short-context scenarios, ICL has now been pushed towards an intriguing horizon—the ability to handle and learn from contexts as broad as entire training datasets. This approach contrasts starkly with methods like example retrieval and finetuning, sparking a detailed exploration of ICL at this vast scale by employing multiple datasets and models.

Key Findings: Performance in Long-context ICL

Researchers found that ICL performance notably increases even with hundreds to thousands of demonstrations, revealing a critical insight: more data can indeed empower LLMs to perform better under in-context learning setups. Here are some major points from their analysis:

  1. Performance Scaling: As the number of in-context examples rises to extreme values, the typical behavior of ICL shifts, showing promising improvements in accuracy and robustness against input confusion caused by shuffling. These findings were particularly marked when using up to 2000 demonstrations.
  2. Comparing Retrieval and Random Sampling: Initially, retrieving relevant examples significantly outperformed using a random subset of demonstrations. However, as the number of demonstrations increased, this advantage diminished, suggesting diminished benefits from retrieval in high-resource settings.
  3. Inferiority of Label Grouping: Grouping demonstrations by label was shown to hurt performance, a stark contrast to randomly mingling examples, which appears to foster better general learning of tasks within the model.

The insights into long-context ICL pave the way for less computationally expensive yet effective methodologies to leverage a singular, large set of cached demonstrations across different inference examples.

Comparison with Finetuning

Regarding the effectiveness of finetuning versus ICL, the paper presents an intricate comparison:

  1. Data Hunger: Finetuning showed a greater hunger for data compared to ICL, specifically in high-demonstration scenarios, although it occasionally surpassed ICL in performance with sufficient data.
  2. Performance Gains: For datasets with larger label spaces, finetuning did not consistently outperform ICL, signaling a nuanced interaction between task complexity, label diversity, and the chosen approach (ICL or finetuning).

These findings suggest scenarios where traditional finetuning might not be as effective as previously perceived, especially when data availability scales up significantly.

Future Implications and Theoretical Insights

The paper propounds several future paths and theoretical implications for AI and machine learning:

  • Efficiency vs. Effectiveness: As adding more examples to ICL setups continues to prove beneficial, the balance between computational efficiency (especially during inference) and learning effectiveness will become a critical factor in systems design.
  • The Role of Memory and Recall in LLMs: The decreasing importance of meticulous example selection with increased context size hints at a fundamental capability of LLMs to utilize broader memories more effectively.
  • Potential for Less Supervised Learning: The ability of LLMs to learn from large context windows with less curated examples posits a future where less supervised, yet more robust models could become commonplace.

Speculating on What Lies Ahead

Looking forward, the trajectory of in-context learning, especially within the field of long-context models, is likely to intersect more with the development of new model architectures and possibly, new paradigms of machine learning that lean heavily on less supervision and greater data utilization. The research highlighted not only enriches our understanding of current model capabilities but also subtly nods towards a future where models could become more autonomous in learning from vast, unstructured datasets without heavy human oversight or costly re-training cycles.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Many-shot in-context learning, 2024.
  2. Buffet: Benchmarking large language models for few-shot cross-lingual transfer, 2023.
  3. Unlimiformer: Long-range transformers with unlimited length input. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/6f9806a5adc72b5b834b27e4c7c0df9b-Paper-Conference.pdf.
  4. Emergent and predictable memorization in large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Iq0DvhB4Kf.
  5. impact of sample selection on in-context learning for entity extraction from scientific writing. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.338. URL https://aclanthology.org/2023.findings-emnlp.338.
  6. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.nlp4convai-1.5. URL https://aclanthology.org/2020.nlp4convai-1.5.
  7. Extending context window of large language models via positional interpolation, 2023.
  8. Prompt-augmented linear probing: Scaling beyond the limit of few-shot in-context learners, 2023.
  9. Google Deepmind. Our next-generation model: Gemini 1.5, 2024. URL https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#build-experiment.
  10. In-context learning and gradient descent revisited, 2024.
  11. Data engineering for scaling language models to 128k context, 2024.
  12. Lm-infinite: Zero-shot extreme length generalization for large language models, 2024.
  13. Prototypical calibration for few-shot learning of language models, 2022.
  14. Structured prompting: Scaling in-context learning to 1, 000 examples. ArXiv preprint, 2022. URL https://arxiv.org/abs/2212.06713.
  15. In-context learning creates task vectors. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.624. URL https://aclanthology.org/2023.findings-emnlp.624.
  16. Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, 2001. URL https://aclanthology.org/H01-1069.
  17. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  18. Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 2023. doi: 10.1162/tacl˙a˙00547. URL https://aclanthology.org/2023.tacl-1.17.
  19. Mistral 7b, 2023.
  20. Damjan Kalajdzievski. A rank stabilization scaling factor for fine-tuning with lora. ArXiv preprint, 2023. URL https://arxiv.org/abs/2312.03732.
  21. An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1131. URL https://aclanthology.org/D19-1131.
  22. Diverse demonstrations improve in-context compositional generalization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.78. URL https://aclanthology.org/2023.acl-long.78.
  23. Same task, more tokens: the impact of input length on the reasoning performance of large language models, 2024.
  24. How long can context length of open-source LLMs truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023a. URL https://openreview.net/forum?id=LywifFNXV5.
  25. Loogle: Can long-context language models understand long contexts?, 2023b.
  26. Long-context llms struggle with long in-context learning, 2024.
  27. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002. URL https://aclanthology.org/C02-1150.
  28. Dual operating modes of in-context learning, 2024.
  29. Ring attention with blockwise transformers for near-infinite context, 2023.
  30. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=rBCvMG-JsPd.
  31. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 2024. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00638. URL https://doi.org/10.1162/tacl_a_00638.
  32. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  33. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  34. In-context learning for text classification with many labels. In Dieuwke Hupkes, Verna Dankers, Khuyagbaatar Batsuren, Koustuv Sinha, Amirhossein Kazemnejad, Christos Christodoulopoulos, Ryan Cotterell, and Elia Bruni (eds.), Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.genbench-1.14. URL https://aclanthology.org/2023.genbench-1.14.
  35. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.201. URL https://aclanthology.org/2022.naacl-main.201.
  36. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022b. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.759.
  37. Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.779. URL https://aclanthology.org/2023.findings-acl.779.
  38. Macro f1 and macro f1, 2021.
  39. What in-context learning “learns” in-context: Disentangling task recognition and task learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.527. URL https://aclanthology.org/2023.findings-acl.527.
  40. Yarn: Efficient context window extension of large language models, 2023.
  41. True few-shot learning with language models. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/5c04925674920eb58467fb52ce4ef728-Abstract.html.
  42. Parallel context windows for large language models. In Annual Meeting of the Association for Computational Linguistics, 2022. URL https://api.semanticscholar.org/CorpusID:258686160.
  43. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 2009. doi: 10.1561/1500000019.
  44. Code llama: Open foundation models for code, 2024.
  45. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2023.
  46. Hierarchical context merging: Better long context understanding for pre-trained LLMs. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ulaUJFd96G.
  47. TogetherAI. Llama-2-7b-32k-instruct - and fine-tuning for llama-2 models with together api, 2023. URL https://www.together.ai/blog/llama-2-7b-32k-instruct.
  48. Llama 2: Open foundation and fine-tuned chat models, 2023.
  49. Focused transformer: Contrastive training for context scaling. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/8511d06d5590f4bda24d42087802cc81-Paper-Conference.pdf.
  50. Transformers learn in-context by gradient descent, 2023.
  51. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  52. Efficient streaming language models with attention sinks, 2024.
  53. Pawel Swietojanski Xingkun Liu, Arash Eshghi and Verena Rieser. Benchmarking natural language understanding services for building conversational agents. In Proceedings of the Tenth International Workshop on Spoken Dialogue Systems Technology (IWSDS), Ortigia, Siracusa (SR), Italy, 2019. Springer. URL http://www.xx.xx/xx/.
  54. Effective long-context scaling of foundation models, 2023.
  55. Towards understanding in-context learning with contrastive demonstrations and saliency maps, 2023.
  56. Long-context language modeling with parallel context encoding, 2024.
  57. Label errors in BANKING77. In Proceedings of the Third Workshop on Insights from Negative Results in NLP, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.insights-1.19. URL https://aclanthology.org/2022.insights-1.19.
  58. MEGABYTE: Predicting million-byte sequences with multiscale transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=JTmO2V9Xpz.
  59. Position-aware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1004. URL https://aclanthology.org/D17-1004.
  60. Calibrate before use: Improving few-shot performance of language models. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Proceedings of Machine Learning Research. PMLR, 2021. URL http://proceedings.mlr.press/v139/zhao21c.html.
  61. Pose: Efficient context window extension of llms via positional skip-wise training, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Amanda Bertsch (14 papers)
  2. Maor Ivgi (12 papers)
  3. Uri Alon (40 papers)
  4. Jonathan Berant (107 papers)
  5. Matthew R. Gormley (22 papers)
  6. Graham Neubig (342 papers)
Citations (42)
Youtube Logo Streamline Icon: https://streamlinehq.com