Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions (2407.02028v1)
Abstract: We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. For that purpose, we created a novel benchmark consisting of hard scientific questions, each paired with a context of various relevancy. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context. This effect is especially visible for open questions and questions of high difficulty or novelty. This result reveals a fundamental difference between the treatment of close-form and open-form questions by large-LLMs and shows a need for a more robust evaluation of in-context learning on the variety of different types of questions. It also poses a new question of how to optimally select a context for LLMs, especially in the context of Retrieval Augmented Generation (RAG) systems. Our results suggest that the answer to this question can be highly application-dependent and might be contingent on factors including the format of the question, the perceived difficulty level of the questions, and the novelty or popularity of the information we seek.
- Large language models and the perils of their hallucinations. Critical Care, 27(1), March 2023. ISSN 1364-8535. doi: 10.1186/s13054-023-04393-x. URL http://dx.doi.org/10.1186/s13054-023-04393-x.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021. doi: 10.1145/3442188.3445922. URL http://dx.doi.org/10.1145/3442188.3445922.
- Graph of Thoughts: Solving Elaborate Problems with Large Language Models, 2023. URL https://arxiv.org/abs/2308.09687.
- PIQA: Reasoning about Physical Commonsense in Natural Language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press, 2020. doi: 10.1609/AAAI.V34I05.6239. URL https://doi.org/10.1609/aaai.v34i05.6239.
- On the Opportunities and Risks of Foundation Models, 2021. URL https://arxiv.org/abs/2108.07258.
- Language Models are Few-Shot Learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, March 2024. ISSN 2157-6912. doi: 10.1145/3641289. URL http://dx.doi.org/10.1145/3641289.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, 2018. URL https://arxiv.org/abs/1803.05457.
- A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32), August 2022. ISSN 1091-6490. doi: 10.1073/pnas.2123433119. URL http://dx.doi.org/10.1073/pnas.2123433119.
- Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
- Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021.
- Retrieval-augmented generation for large language models: A survey, 2024. URL https://arxiv.org/abs/2312.10997.
- Temporal generalization for spoken language understanding. In NAACL 2022, 2022. URL https://www.amazon.science/publications/temporal-generalization-for-spoken-language-understanding.
- What Can Large Language Models Do in Chemistry? A Comprehensive Benchmark on Eight Tasks, 2023. URL https://arxiv.org/abs/2305.18365.
- Human-like intuitive behavior and reasoning biases emerged in language models – and disappeared in gpt-4. 2023. doi: 10.1038/s43588-023-00527-x.
- Lora: Low-rank adaptation of large language models, 2021.
- MathPrompter: Mathematical Reasoning using Large Language Models. In Sunayana Sitaram, Beata Beigman Klebanov, and Jason D. Williams, editors, Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics: Industry Track, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 37–42. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-INDUSTRY.4. URL https://doi.org/10.18653/v1/2023.acl-industry.4.
- Capturing failures of large language models via human cognitive biases. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 11785–11799. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/4d13b2d99519c5415661dad44ab7edcd-Paper-Conference.pdf.
- Large language models struggle to learn long-tail knowledge. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/kandpal23a.html.
- Large Language Models are Zero-Shot Reasoners. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html.
- Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs, 2023. URL https://arxiv.org/abs/2305.03111.
- Bing Liu. Lifelong machine learning: a paradigm for continuous learning. Frontiers of Computer Science, 11(3):359–361, June 2017. ISSN 2095-2236. doi: 10.1007/s11704-016-6903-6. URL http://dx.doi.org/10.1007/s11704-016-6903-6.
- What makes good in-context examples for gpt-3? CoRR, abs/2101.06804, 2021. URL https://arxiv.org/abs/2101.06804.
- What makes good incontext examples for gpt-3? In Proceedings of Deep Learning Inside Out: The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2022.
- Estimating the carbon footprint of bloom, a 176b parameter language model. Journal of Machine Learning Research, 24(253):1–15, 2023. URL http://jmlr.org/papers/v24/23-0069.html.
- Large language models surpass human experts in predicting neuroscience results, 2024.
- MetaICL: Learning to learn in context. In NAACL-HLT, 2022a.
- Rethinking the role of demonstrations: What makes in-context learning work?, 2022b.
- Potato: The portable text annotation tool. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2022.
- Measuring and narrowing the compositionality gap in language models, 2023.
- In-context learning with iterative demonstration selection. CoRR, abs/2310.09881, 2023.
- The Troubling Emergence of Hallucination in Large Language Models – An Extensive Definition, Quantification, and Prescriptive Remediations, 2023. URL https://arxiv.org/abs/2310.04988.
- Learning to retrieve prompts for in-context learning. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. URL https://aclanthology.org/2022.naacl-main.191.
- Enchancing Semi-Supervised Learning for Extractive Summarization with an LLM-based pseudolabeler, 2023. URL https://arxiv.org/abs/2311.09559.
- Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure, 2023. URL https://arxiv.org/abs/2311.07590.
- Mohammed Latif Siddiq and Joanna C. S. Santos. Generate and Pray: Using SALLMS to Evaluate the Security of LLM Generated Code. CoRR, abs/2311.00889, 2023. doi: 10.48550/ARXIV.2311.00889. URL https://doi.org/10.48550/arXiv.2311.00889.
- Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap, 2024.
- SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research, 2023. URL https://arxiv.org/abs/2308.13149.
- Iteratively prompt pre-trained language models for chain of thought, 2022.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
- A comparative study of open-source large language models, gpt-4 and claude 2: Multiple-choice test taking in nephrology, 08 2023a.
- Openicl: An open-source framework for in-context learning, 2023b.
- Pixiu: A large language model, instruction data and evaluation benchmark for finance, 2023a.
- Ask again, then fail: Large language models’ vacillations in judgment, 2023b.
- An explanation of in-context learning as implicit bayesian inference, 2022.
- Cognitive mirage: A review of hallucinations in large language models, 2023.
- AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, 2023. URL https://arxiv.org/abs/2304.06364.
- Least-to-most prompting enables complex reasoning in large language models, 2023.
- Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, page 1–20, 2022. ISSN 1939-3539. doi: 10.1109/tpami.2022.3195549. URL http://dx.doi.org/10.1109/TPAMI.2022.3195549.