Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions (2407.02028v1)

Published 2 Jul 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. For that purpose, we created a novel benchmark consisting of hard scientific questions, each paired with a context of various relevancy. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context. This effect is especially visible for open questions and questions of high difficulty or novelty. This result reveals a fundamental difference between the treatment of close-form and open-form questions by large-LLMs and shows a need for a more robust evaluation of in-context learning on the variety of different types of questions. It also poses a new question of how to optimally select a context for LLMs, especially in the context of Retrieval Augmented Generation (RAG) systems. Our results suggest that the answer to this question can be highly application-dependent and might be contingent on factors including the format of the question, the perceived difficulty level of the questions, and the novelty or popularity of the information we seek.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Large language models and the perils of their hallucinations. Critical Care, 27(1), March 2023. ISSN 1364-8535. doi: 10.1186/s13054-023-04393-x. URL http://dx.doi.org/10.1186/s13054-023-04393-x.
  2. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021. doi: 10.1145/3442188.3445922. URL http://dx.doi.org/10.1145/3442188.3445922.
  3. Graph of Thoughts: Solving Elaborate Problems with Large Language Models, 2023. URL https://arxiv.org/abs/2308.09687.
  4. PIQA: Reasoning about Physical Commonsense in Natural Language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. doi: 10.1609/AAAI.V34I05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
  5. On the Opportunities and Risks of Foundation Models, 2021. URL https://arxiv.org/abs/2108.07258.
  6. Language Models are Few-Shot Learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  7. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, March 2024. ISSN 2157-6912. doi: 10.1145/3641289. URL http://dx.doi.org/10.1145/3641289.
  8. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018. URL https://arxiv.org/abs/1803.05457.
  9. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32), August 2022. ISSN 1091-6490. doi: 10.1073/pnas.2123433119. URL http://dx.doi.org/10.1073/pnas.2123433119.
  10. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021.
  12. Retrieval-augmented generation for large language models: A survey, 2024. URL https://arxiv.org/abs/2312.10997.
  13. Temporal generalization for spoken language understanding. In NAACL 2022, 2022. URL https://www.amazon.science/publications/temporal-generalization-for-spoken-language-understanding.
  14. What Can Large Language Models Do in Chemistry? A Comprehensive Benchmark on Eight Tasks, 2023. URL https://arxiv.org/abs/2305.18365.
  15. Human-like intuitive behavior and reasoning biases emerged in language models – and disappeared in gpt-4. 2023. doi: 10.1038/s43588-023-00527-x.
  16. Lora: Low-rank adaptation of large language models, 2021.
  17. MathPrompter: Mathematical Reasoning using Large Language Models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D. Williams, editors, Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics: Industry Track, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 37–42. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-INDUSTRY.4. URL https://doi.org/10.18653/v1/2023.acl-industry.4.
  18. Capturing failures of large language models via human cognitive biases. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 11785–11799. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/4d13b2d99519c5415661dad44ab7edcd-Paper-Conference.pdf.
  19. Large language models struggle to learn long-tail knowledge. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/kandpal23a.html.
  20. Large Language Models are Zero-Shot Reasoners. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html.
  21. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, 2023. URL https://arxiv.org/abs/2305.03111.
  22. Bing Liu. Lifelong machine learning: a paradigm for continuous learning. Frontiers of Computer Science, 11(3):359–361, June 2017. ISSN 2095-2236. doi: 10.1007/s11704-016-6903-6. URL http://dx.doi.org/10.1007/s11704-016-6903-6.
  23. What makes good in-context examples for gpt-3? CoRR, abs/2101.06804, 2021. URL https://arxiv.org/abs/2101.06804.
  24. What makes good incontext examples for gpt-3? In Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2022.
  25. Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24(253):1–15, 2023. URL http://jmlr.org/papers/v24/23-0069.html.
  26. Large language models surpass human experts in predicting neuroscience results, 2024.
  27. MetaICL: Learning to learn in context. In NAACL-HLT, 2022a.
  28. Rethinking the role of demonstrations: What makes in-context learning work?, 2022b.
  29. Potato: The portable text annotation tool. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2022.
  30. Measuring and narrowing the compositionality gap in language models, 2023.
  31. In-context learning with iterative demonstration selection. CoRR, abs/2310.09881, 2023.
  32. The Troubling Emergence of Hallucination in Large Language Models – An Extensive Definition, Quantification, and Prescriptive Remediations, 2023. URL https://arxiv.org/abs/2310.04988.
  33. Learning to retrieve prompts for in-context learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main.191.
  34. Enchancing Semi-Supervised Learning for Extractive Summarization with an LLM-based pseudolabeler, 2023. URL https://arxiv.org/abs/2311.09559.
  35. Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure, 2023. URL https://arxiv.org/abs/2311.07590.
  36. Mohammed Latif Siddiq and Joanna C. S. Santos. Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code. CoRR, abs/2311.00889, 2023. doi: 10.48550/ARXIV.2311.00889. URL https://doi.org/10.48550/arXiv.2311.00889.
  37. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024.
  38. SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research, 2023. URL https://arxiv.org/abs/2308.13149.
  39. Iteratively prompt pre-trained language models for chain of thought, 2022.
  40. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
  41. A comparative study of open-source large language models, gpt-4 and claude 2: Multiple-choice test taking in nephrology, 08 2023a.
  42. Openicl: An open-source framework for in-context learning, 2023b.
  43. Pixiu: A large language model, instruction data and evaluation benchmark for finance, 2023a.
  44. Ask again, then fail: Large language models’ vacillations in judgment, 2023b.
  45. An explanation of in-context learning as implicit bayesian inference, 2022.
  46. Cognitive mirage: A review of hallucinations in large language models, 2023.
  47. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, 2023. URL https://arxiv.org/abs/2304.06364.
  48. Least-to-most prompting enables complex reasoning in large language models, 2023.
  49. Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–20, 2022. ISSN 1939-3539. doi: 10.1109/tpami.2022.3195549. URL http://dx.doi.org/10.1109/TPAMI.2022.3195549.
Citations (1)

Summary

  • The paper introduces a novel benchmark of 160 scientific questions to systematically assess in-context learning in large language models.
  • It employs a rigorous evaluation framework with multi-criteria scoring, revealing that irrelevant context can surprisingly boost performance on open questions.
  • Findings suggest that context selection strategies must be tailored to question type to optimize GPT-4’s performance across varying difficulty levels.

Evaluating In-Context Learning on Open and Closed Questions

The paper "Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions" by Xiang Li et al. meticulously examines the efficacy of in-context learning (ICL) within LLMs like GPT-4 when faced with varied question types. The authors create a novel benchmark comprising difficult scientific questions, ranging from open to closed formats, to explore how different types of context influence the performance of LLMs. Notably, the results reveal that context relevancy does not straightforwardly correlate with improved performance, particularly for open questions and those of high difficulty or novelty.

Key Contributions

The paper offers several noteworthy contributions:

  1. Novel Dataset Creation:
    • The authors developed a new benchmark consisting of 160 unique scientific questions from the domains of physics and computer science, spanning different difficulty levels and question originality. Each question was paired with varying context types, including highly relevant, vague, irrelevant, and no context.
  2. Comprehensive Evaluation Framework:
    • The responses generated by GPT-4 were evaluated using a detailed scoring rubric encompassing Completeness and Relevancy (Correctness), Logic and Reasoning, and Truthfulness (lack of hallucinations). Each question was assessed by six independent graders, who also provided qualitative feedback.
  3. Comparison with Existing Benchmarks:
    • The paper juxtaposes its findings with in-context learning performance on well-established datasets, such as MetaICL and NephSAP, which consist exclusively of closed-form questions. As opposed to the open questions, the results for closed questions showed a positive correlation between context relevancy and model performance.

Numerical Results

The numerical findings of the paper highlight significant discrepancies:

  • For open questions, the overall performance of the model improved when provided with irrelevant or no context, contrary to the expectation that more relevant context would enhance accuracy.
  • This counter-intuitive trend was especially pronounced for questions of higher difficulty and novelty, suggesting complex interactions within the model when handling more challenging inquiries.
  • In closed-form questions, as demonstrated on the MetaICL and NephSAP datasets, more relevant context led to improved performance, corroborating earlier studies that emphasized the utility of relevant context in in-context learning.

Theoretical and Practical Implications

The results have profound implications for both the theoretical understanding of in-context learning and its practical application:

  • Theoretical Implications:
    • The paper underscores the inherent differences in how LLMs process open versus closed questions. In contexts where the model needs to generate open-form responses, simpler or less relevant contexts might reduce cognitive load, aiding the model in solving complex problems without being overly constrained by the context.
    • This insight prompts further investigation into the cognitive mechanisms of LLMs, particularly how they balance context utilization between understanding a problem and generating a response.
  • Practical Implications:
    • For practical applications, the findings indicate that designing retrieval-augmented generation (RAG) systems should be highly task-specific. Depending on whether the task involves closed-form or open-form questions, the strategy for context selection must be tailored to prevent performance degradation.
    • The suggestion of sampling context from regions beyond the immediate vicinity of the query point in embedding space could lead to novel approaches in enhancing LLM performance in varied practical scenarios.

Future Research Directions

This research opens several avenues for future exploration:

  • Refinement of Context Selection Strategies:
    • Further studies could refine the proposed context selection strategies by experimenting with different "shells" of context relevancy. Determining the optimal thickness and range of these shells may yield more precise guidance for RAG systems.
  • Extended Benchmarking Across Domains:
    • Expanding the benchmark to include diverse scientific and non-scientific domains can generalize the findings. Understanding whether these trends hold across disciplines would be vital for developing universally robust LLMs.
  • Exploration of Model Internals:
    • Investigations into the internal processes of LLMs when dealing with varied context types might illuminate how different regions of the model's architecture contribute to its context handling capabilities.

In summary, Xiang Li et al. provide a comprehensive and insightful investigation into the nuances of in-context learning, revealing significant differences between open and closed question processing within LLMs. These findings challenge traditional notions of context relevancy and offer a valuable framework for advancing both the theoretical understanding and practical deployment of AI systems.