Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

9 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions (2407.02028v1)

Published 2 Jul 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. For that purpose, we created a novel benchmark consisting of hard scientific questions, each paired with a context of various relevancy. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context. This effect is especially visible for open questions and questions of high difficulty or novelty. This result reveals a fundamental difference between the treatment of close-form and open-form questions by large-LLMs and shows a need for a more robust evaluation of in-context learning on the variety of different types of questions. It also poses a new question of how to optimally select a context for LLMs, especially in the context of Retrieval Augmented Generation (RAG) systems. Our results suggest that the answer to this question can be highly application-dependent and might be contingent on factors including the format of the question, the perceived difficulty level of the questions, and the novelty or popularity of the information we seek.

References (49)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel benchmark of 160 scientific questions to systematically assess in-context learning in large language models.
It employs a rigorous evaluation framework with multi-criteria scoring, revealing that irrelevant context can surprisingly boost performance on open questions.
Findings suggest that context selection strategies must be tailored to question type to optimize GPT-4’s performance across varying difficulty levels.

Evaluating In-Context Learning on Open and Closed Questions

The paper "Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions" by Xiang Li et al. meticulously examines the efficacy of in-context learning (ICL) within LLMs like GPT-4 when faced with varied question types. The authors create a novel benchmark comprising difficult scientific questions, ranging from open to closed formats, to explore how different types of context influence the performance of LLMs. Notably, the results reveal that context relevancy does not straightforwardly correlate with improved performance, particularly for open questions and those of high difficulty or novelty.

Key Contributions

The paper offers several noteworthy contributions:

Novel Dataset Creation:
- The authors developed a new benchmark consisting of 160 unique scientific questions from the domains of physics and computer science, spanning different difficulty levels and question originality. Each question was paired with varying context types, including highly relevant, vague, irrelevant, and no context.
Comprehensive Evaluation Framework:
- The responses generated by GPT-4 were evaluated using a detailed scoring rubric encompassing Completeness and Relevancy (Correctness), Logic and Reasoning, and Truthfulness (lack of hallucinations). Each question was assessed by six independent graders, who also provided qualitative feedback.
Comparison with Existing Benchmarks:
- The paper juxtaposes its findings with in-context learning performance on well-established datasets, such as MetaICL and NephSAP, which consist exclusively of closed-form questions. As opposed to the open questions, the results for closed questions showed a positive correlation between context relevancy and model performance.

Numerical Results

The numerical findings of the paper highlight significant discrepancies:

For open questions, the overall performance of the model improved when provided with irrelevant or no context, contrary to the expectation that more relevant context would enhance accuracy.
This counter-intuitive trend was especially pronounced for questions of higher difficulty and novelty, suggesting complex interactions within the model when handling more challenging inquiries.
In closed-form questions, as demonstrated on the MetaICL and NephSAP datasets, more relevant context led to improved performance, corroborating earlier studies that emphasized the utility of relevant context in in-context learning.

Theoretical and Practical Implications

The results have profound implications for both the theoretical understanding of in-context learning and its practical application:

Theoretical Implications:
- The paper underscores the inherent differences in how LLMs process open versus closed questions. In contexts where the model needs to generate open-form responses, simpler or less relevant contexts might reduce cognitive load, aiding the model in solving complex problems without being overly constrained by the context.
- This insight prompts further investigation into the cognitive mechanisms of LLMs, particularly how they balance context utilization between understanding a problem and generating a response.
Practical Implications:
- For practical applications, the findings indicate that designing retrieval-augmented generation (RAG) systems should be highly task-specific. Depending on whether the task involves closed-form or open-form questions, the strategy for context selection must be tailored to prevent performance degradation.
- The suggestion of sampling context from regions beyond the immediate vicinity of the query point in embedding space could lead to novel approaches in enhancing LLM performance in varied practical scenarios.

Future Research Directions

This research opens several avenues for future exploration:

Refinement of Context Selection Strategies:
- Further studies could refine the proposed context selection strategies by experimenting with different "shells" of context relevancy. Determining the optimal thickness and range of these shells may yield more precise guidance for RAG systems.
Extended Benchmarking Across Domains:
- Expanding the benchmark to include diverse scientific and non-scientific domains can generalize the findings. Understanding whether these trends hold across disciplines would be vital for developing universally robust LLMs.
Exploration of Model Internals:
- Investigations into the internal processes of LLMs when dealing with varied context types might illuminate how different regions of the model's architecture contribute to its context handling capabilities.

In summary, Xiang Li et al. provide a comprehensive and insightful investigation into the nuances of in-context learning, revealing significant differences between open and closed question processing within LLMs. These findings challenge traditional notions of context relevancy and offer a valuable framework for advancing both the theoretical understanding and practical deployment of AI systems.

PDF Markdown

Tweets

https://twitter.com/_reachsumit/status/1808524487278878802

https://twitter.com/GptMaestro/status/1809094224685772903