- The paper introduces the In-Context Transferability metric to leverage exemplar contexts for efficient task transfer across benchmarks.
- It reduces tasks in benchmarks like MMLU and FLAN by up to 95% while maintaining evaluation accuracy within a 4% margin.
- The method employs spectral clustering and facility location optimization to group related tasks, boosting cost-effective LLM evaluation.
In-Context Transferability for Benchmark Task Reduction
The paper presents a method to efficiently reduce the number of tasks in LLM benchmarks without compromising evaluation quality, leveraging in-context learning for task transferability. This approach addresses the substantial costs associated with evaluating LLMs on extensive benchmarks by proposing a task reduction methodology informed by task transferability and relevance.
Overview of the Proposed Method
The paper introduces a metric, In-Context Transferability (ICT), to estimate the transferability between tasks using exemplar contexts. By assessing pairwise transferability, the authors demonstrate that tasks in modern LLM benchmarks such as MMLU or FLAN can be reduced to 5% with less than a 4% difference in evaluation outcomes compared to the full benchmark. This method is notably training-free and requires only in-context learning (ICL).
Task Transferability Analysis
The paper explores task transferability, showing that a model's performance on one task can enhance its performance on a related task. Transferability is quantified through a matrix representing the improvements when using one task's exemplars as context for another task's queries. The resulting transferability matrix is visualized to reveal inherent clustering patterns among tasks, which are then explored using spectral clustering to identify clusters of thematically associated tasks.
Benchmark Task Reduction
The task reduction is framed as a facility location (FL) problem where the objective is to choose a subset of tasks that maximizes similarity to the original benchmark. The similarity is derived from the task transferability matrix and further refined using Laplacian Eigenmaps to enhance robustness. This optimization is efficiently tackled using a greedy algorithm, facilitating a significant reduction of tasks while maintaining high evaluation accuracy.
Experimental Results
The paper provides experimental evidence showing that their approach consistently outperforms baseline methods, including random selection and text-similarity-based methods using BM25. On benchmarks like MMLU and FLAN, the proposed method achieves an impressive reduction in tasks with minor losses in evaluation accuracy across various LLMs. The experimental analysis highlights the superiority of ICT-based task clustering and reduction over human-judgment-based selections, such as those using GPT-4.
Implications and Future Directions
The work's implications are significant for both practical and theoretical aspects of LLM evaluation. Practically, the reduction method offers a cost-effective means for rapid assessment of LLM capabilities, essential for model development cycles. Theoretically, the insights into task transferability and clustering contribute to the understanding of task dependencies in multi-task learning scenarios.
For future developments, the application of ICT across different model types and broader task sets could enhance its robustness and applicability in diverse real-world scenarios. Additionally, exploring adaptive strategies for determining hyperparameters dynamically based on dataset characteristics could further refine the method's efficacy.
In conclusion, the paper proposes a well-substantiated approach to efficiently reduce LLM benchmark tasks using in-context learning and task transferability insights, setting a foundation for scalable and cost-efficient model evaluation in AI research.