Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark (2412.13102v3)

Published 17 Dec 2024 in cs.IR and cs.CL

Abstract: Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by LLMs without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces AIR-Bench’s core innovation: an automated, LLM-driven pipeline that generates heterogeneous IR benchmark datasets without manual labeling.
  • It covers 2 tasks, 9 domains, and 13 languages across 69 datasets, showcasing its practical capability for comprehensive IR model evaluations.
  • The paper validates dataset quality through stringent control measures and strong correlations with human-labeled data, underscoring its reliability for model assessment.

Automated Evaluation with AIR-Bench for Information Retrieval Models

The development and evaluation of Information Retrieval (IR) models are crucially dependent on the availability of comprehensive and sophisticated benchmarks. Traditional benchmarks often face limitations due to their reliance on predefined domains and manually labeled data, which can hinder their effectiveness in evaluating models in newly emerging domains. Addressing these limitations, AIR-Bench introduces a novel Automated Heterogeneous Information Retrieval Benchmark with significant innovations aimed at improving IR model evaluation.

The AIR-Bench framework is constructed around three core features—automation, heterogeneity, and dynamism—distinguishing it from existing benchmarks. The automated aspect leverages LLMs to generate test data, eliminating the need for manual intervention and providing a cost-efficient and swift approach to support new domain evaluations. The heterogeneous nature of AIR-Bench allows it to encompass diverse tasks, domains, and languages, currently covering 2 tasks, 9 domains, and 13 languages, across 69 datasets. This extensive coverage facilitates robust model evaluations in various scenarios, potentially accelerating progress in both established and nascent domains. Furthermore, its dynamic characteristic ensures the regular augmentation of tasks and languages, making it adaptable and continuously comprehensive.

The authors develop a robust data generation pipeline to reliably simulate real-world corpora, ensuring the LLM-generated data's alignment with human-labeled data. This vital step, encompassing preparation, candidate generation, and quality control, underscores AIR-Bench’s credibility as a benchmark. The consistency analysis between LLM-generated and human-labeled datasets demonstrates strong correlation (as evidenced by the Spearman rank correlation coefficient), indicating the validity of AIR-Bench-generated datasets for actual evaluations. Additionally, the pipeline integrates comprehensive quality control measures using a mix of embedding and reranking models to enhance dataset quality significantly.

In practical terms, AIR-Bench not only broadens the spectrum of task-specific datasets beyond MTEB/BEIR but also proves adept at distinguishing the capabilities of various IR models. Strikingly, AIR-Bench showcases the ability to reflect performance improvements in models fine-tuned for specific domains, suggesting its utility for domain-specific performance evaluation. This differentiation brings to light the benchmark's robustness and usability in evaluating diverse model architectures, sizes, and the extent of training data across varied domains and languages—an aspect less explored with MTEB/BEIR.

While AIR-Bench’s contributions are substantial, some challenges remain, including the dependence on large, real-world corpora for dataset generation and the potential biases inherited from the capabilities of LLMs and control models used in quality assurance. Future developments will need to consider these aspects to bolster AIR-Bench’s versatility and neutrality further.

AIR-Bench stands as a significant advancement in providing a comprehensive, versatile IR evaluation framework, empowering community developers by enabling vast and scalable evaluations across multiple dimensions. Its ongoing expansion promises even greater contributions to the field, inviting collaborative enhancements and continuous adaptation to the ever-evolving landscape of information retrieval and artificial intelligence.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com