Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SynDL: A Large-Scale Synthetic Test Collection for Passage Retrieval (2408.16312v3)

Published 29 Aug 2024 in cs.IR

Abstract: Large-scale test collections play a crucial role in Information Retrieval (IR) research. However, according to the Cranfield paradigm and the research into publicly available datasets, the existing information retrieval research studies are commonly developed on small-scale datasets that rely on human assessors for relevance judgments - a time-intensive and expensive process. Recent studies have shown the strong capability of LLMs in producing reliable relevance judgments with human accuracy but at a greatly reduced cost. In this paper, to address the missing large-scale ad-hoc document retrieval dataset, we extend the TREC Deep Learning Track (DL) test collection via additional LLM synthetic labels to enable researchers to test and evaluate their search systems at a large scale. Specifically, such a test collection includes more than 1,900 test queries from the previous years of tracks. We compare system evaluation with past human labels from past years and find that our synthetically created large-scale test collection can lead to highly correlated system rankings.

Citations (1)

Summary

  • The paper extends the TREC Deep Learning test collections using GPT-4 generated synthetic relevance judgments for over 1,900 queries.
  • It demonstrates strong correlation between synthetic and human judgments with Kendall’s tau values of up to 0.8571.
  • The work offers a scalable, cost-effective methodology for rigorous evaluation of search systems in passage retrieval tasks.

SynDL: A Large-Scale Synthetic Test Collection

The paper "SynDL: A Large-Scale Synthetic Test Collection" presents an extensive synthetic dataset aimed at addressing fundamental challenges within the Information Retrieval (IR) community, specifically within the context of ad-hoc document and passage retrieval. Developed by Rahmani et al., SynDL leverages the TREC Deep Learning (DL) Track test collections enhanced with synthetic labels generated by LLMs, particularly GPT-4.

Key Contributions

The core contribution of this paper is threefold:

  1. Extension of Existing Test Collections: The paper extends the TREC Deep Learning Track test collections by incorporating synthetic relevance judgments. This results in a comprehensive dataset comprising over 1,900 test queries, augmented with a significantly larger and more diverse sample size than previous collections.
  2. Use of LLMs for Judgments: Synthetic relevance labels are generated using GPT-4, providing a cost-effective and scalable alternative to traditional human relevance judgments. The paper claims that these synthetic labels strongly correlate with human labels, offering an efficient solution for large-scale evaluation.
  3. Robust System Evaluation: The synthetic dataset, SynDL, supports rigorous evaluation of search systems on a large scale. With a highly diversified set of queries and deep relevance labels, the dataset facilitates robust system performance assessment and comparison.

Methodology

The development of SynDL follows a structured methodology:

  1. Initial Query Assemble: The initial queries are aggregated from the TREC Deep Learning Track runs (2019-2023), including both human-generated and synthetic queries, resulting in a pool of 1,988 queries.
  2. Assessment Pool Generation: Utilizing extensive submissions from the TREC DL Tracks, a depth-10 pool is generated with rich coverage of passages. This leads to a comprehensive pool of 637,063 query-passage pairs for relevance assessment.
  3. Automatic Judgment with LLM: GPT-4 is utilized to provide high granularity relevance judgments (i.e., related, highly relevant, perfectly relevant), ensuring a deep and nuanced assessment of passage relevance to queries.

Results and Evaluation

The paper reports high correlation between system rankings obtained from SynDL and those derived from human assessments in the TREC DL test collections. Specifically:

  • Correlation Metrics: Kendall’s tau values of 0.8571 and 0.8286 for NDCG@10 and NDCG@100 respectively indicate a strong agreement.
  • Top-Performing Systems Agreement: There is consistent identification of top-performing systems across different test collections, evidenced by comparable evaluation metrics (NDCG, AP).

Furthermore, the paper addresses potential bias by comparing performance across systems using the same LLMs as those used for generating synthetic queries. The analysis reveals no significant bias, affirming the fairness and robustness of SynDL.

Implications and Future Developments

Practical Implications

SynDL has several practical implications for the IR community:

  • It provides a scalable, cost-effective alternative to human relevance judgments.
  • It allows for comprehensive and rigorous evaluation of search systems, facilitating the development and benchmarking of advanced retrieval models.
  • The inclusion of both real and synthetic queries enhances the versatility of the dataset, supporting a wide range of IR research.

Theoretical Implications

From a theoretical perspective, SynDL:

  • Supports the validation of LLMs in generating high-quality relevance judgments.
  • Encourages further exploration into synthetic data generation techniques and their applications in IR.
  • Opens avenues for research into the comparative efficacy of human vs synthetic datasets in system evaluation.

Speculations for Future AI Developments

Looking ahead, the evolution of LLMs may enhance the quality and granularity of synthetic relevance judgments further. This may lead to:

  • More nuanced synthetic test collections with deeper contextual understanding.
  • Integration of multimodal data (text, images, etc.) in IR evaluation.
  • Development of generalized models capable of few-shot or zero-shot learning for diverse IR tasks.

Conclusion

The paper "SynDL: A Large-Scale Synthetic Test Collection" presents a significant step forward in the field of Information Retrieval. By leveraging LLMs to generate synthetic relevance judgments, it addresses existing challenges in scale, diversity, and depth of test collections. The SynDL dataset stands to be an invaluable resource for researchers, enabling rigorous, scalable, and cost-effective evaluation of search systems while paving the way for future advancements in synthetic data generation and application in IR.