AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models (2407.08351v1)

Published 11 Jul 2024 in cs.CL and cs.LG

Abstract: Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for LLMs: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a LLM to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in LLMs that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19.

PDF HTML Abstract

AutoBencher: Creating Salient, Novel, Difficult Datasets for LLMs

The paper "AutoBencher: Creating Salient, Novel, Difficult Datasets for LLMs" by Xiang Lisa Li et al. introduces AutoBencher, an automated system designed to generate evaluation benchmarks for LLMs that satisfy three critical desiderata: salience, novelty, and difficulty. Benchmarking in LLMs is crucial for evaluating model performance, discerning trends, and guiding the development of future models. Traditional benchmarks often fail to account for emerging model weaknesses, revealing a pressing need for adaptive and rigorous benchmarking strategies. This paper addresses such gaps through a metric-driven search algorithm employing LLMs to propose and create datasets meeting the specified desiderata.

Salience, Novelty, and Difficulty

The paper elaborates on the following key metrics:

Salience: A benchmark should test practically important capabilities, such as performance on widely recognized historical events like World War II.
Novelty: A benchmark should reveal new trends in model performance, distinguishing models in ways existing benchmarks do not.
Difficulty: The benchmark should pose significant challenges to current models, leaving room for future improvements.

By formalizing these properties, the authors transform benchmark creation into an optimization problem, resolved through a search for datasets that balance all three requirements.

AutoBencher Framework

AutoBencher leverages LLMs to automate the creation of datasets, utilizing privileged information to ensure accuracy and difficulty. The process involves the following steps:

Dataset Construction: The system generates (question, answer) pairs using privileged information like Wikipedia articles for knowledge-intensive questions, translation systems for multilingual questions, and mathematical libraries for math questions. This ensures answers are accurate and provides grounding in reliable sources.
Adaptive Search: AutoBencher performs iterative searches, using a history of proposed topics and their difficulties to guide subsequent topic proposals. This adaptive mechanism aims to iteratively enhance the difficulty and novelty of proposed evaluation topics.
Re-Ranking for Final Selection: After generating datasets, topics are re-ranked based on salience, difficulty, and novelty. This ensures that the final chosen benchmark maximizes the overall objective function.

Experimental Results

AutoBencher was evaluated against existing, human-constructed benchmarks across several domains—history, economics, science, mathematics, and multilingual question answering. The system demonstrated significant enhancements in both novelty and difficulty:

Novelty Increase: AutoBencher-produced datasets showed a 27% improvement in revealing new model performance trends compared to human-constructed datasets.
Difficulty Increase: Datasets generated by AutoBencher exhibited 22% higher difficulty, challenging even state-of-the-art LLMs.

Specific examples highlight the capability of AutoBencher to uncover unique model weaknesses. For instance, while Gemini Pro performed robustly on existing economic datasets, it struggled with questions on Fordism, simultaneously revealing OpenChat-3.5's unexpected strengths in that area.

Discussion and Implications

The automated, scalable nature of AutoBencher could have profound implications for the future of LLM benchmarking. Key takeaways include:

Enhanced Model Evaluation: AutoBencher can continually generate challenging, novel datasets, thereby providing a sustainable methodology for tracking LLM advancements.
Identification of Specific Model Weaknesses: By pinpointing domains where specific models underperform, AutoBencher aids in pinpointing areas needing targeted improvements.
Scalability and Efficiency: The automation reduces the manual effort involved in benchmark creation, accelerating the feedback loop in model development.

Future Directions

Potential future developments of AutoBencher could explore broader domains, including aspects like model safety or efficiency, extending beyond the capabilities discussed (e.g., knowledge-intensive QA, mathematics). Additionally, while overcoming the constraints of domain-specific proposals could enable more creative and comprehensive benchmarking strategies.

Conclusion

AutoBencher represents a significant step forward in the field of LLM evaluation, providing an automated, metric-driven approach to creating salient, novel, and difficult benchmarks. This work not only enhances the current landscape of model evaluation but also introduces a versatile tool that can adapt to the evolving challenges in AI research. Future explorations expanding its utility across diverse domains will likely further consolidate its role in the AI community.