Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts (2305.02320v1)

Published 3 May 2023 in cs.IR

Abstract: We investigate the usefulness of generative LLMs in generating training data for cross-encoder re-rankers in a novel direction: generating synthetic documents instead of synthetic queries. We introduce a new dataset, ChatGPT-RetrievalQA, and compare the effectiveness of models fine-tuned on LLM-generated and human-generated data. Data generated with generative LLMs can be used to augment training data, especially in domains with smaller amounts of labeled data. We build ChatGPT-RetrievalQA based on an existing dataset, human ChatGPT Comparison Corpus (HC3), consisting of public question collections with human responses and answers from ChatGPT. We fine-tune a range of cross-encoder re-rankers on either human-generated or ChatGPT-generated data. Our evaluation on MS MARCO DEV, TREC DL'19, and TREC DL'20 demonstrates that cross-encoder re-ranking models trained on ChatGPT responses are statistically significantly more effective zero-shot re-rankers than those trained on human responses. In a supervised setting, the human-trained re-rankers outperform the LLM-trained re-rankers. Our novel findings suggest that generative LLMs have high potential in generating training data for neural retrieval models. Further work is needed to determine the effect of factually wrong information in the generated responses and test our findings' generalizability with open-source LLMs. We release our data, code, and cross-encoders checkpoints for future work.

PDF Abstract

Exploiting Generative LLMs for Cross-Encoder Re-Rankers: An Analytical Perspective

The paper under review explores the potential of employing generative LLMs, notably ChatGPT, for enhancing the training processes of cross-encoder re-rankers in information retrieval tasks. Central to this investigation is the novel application of generating synthetic documents rather than the widely explored synthetic queries for improving the training datasets of retrieval models. This work provides a sophisticated examination of how the integration of LLM-generated content can rival or even surpass human-generated data in certain training scenarios, thereby offering fresh insights into data augmentation strategies for neural retrieval models.

Methodological Overview

The authors introduce "ChatGPT-RetrievalQA," a newly constructed dataset derived from the pre-existing HC3 dataset. This dataset synergizes ChatGPT-generated documents with human-generated responses, tailored specifically for retrieval tasks in both full-ranking and re-ranking setups. The methodology involves fine-tuning cross-encoder re-rankers with datasets comprising either human-generated or ChatGPT-generated responses. These re-rankers are then evaluated on benchmark datasets such as MS MARCO DEV, TREC DL'19, and TREC DL'20, focusing on both supervised and zero-shot evaluation settings.

Key Findings

Zero-Shot vs Supervised Performance: In zero-shot settings, cross-encoder models trained using ChatGPT-generated responses significantly outperformed their human-trained counterparts across several evaluation metrics. This trend underscores the potential of LLM-generated data in scenarios where supervised training data is scarce or unavailable.
Domain-Specific Efficacy: The analysis extends to domain-specific tasks, revealing that while human-trained models slightly exceed in effectiveness within domain-specific contexts (e.g., medical tasks), the performance margin is minimal. Notably, LLM-generated content continues to offer competitive results, hinting at their versatile applicability across diverse domains.

Implications and Future Directions

The implications of these findings are multifaceted. Practically, this research affirms the utility of LLMs like ChatGPT in generating comprehensive training datasets, especially apt for refining the capabilities of neural retrieval models under data-sparse conditions. Theoretically, it challenges conventional perspectives on data generation, proposing a pivot towards leveraging LLMs as innovative data synthesizers beyond query generation.

The paper further exposes areas ripe for exploration, such as the impacts of incorrect or misleading information within LLM-generated content and the scalability of these findings using open-source LLMs. Additionally, the research signals opportunities to enhance cross-encoder architectures and optimize them for leveraging LLM-generated data, thereby setting the stage for robust and efficient retrieval systems.

Conclusion

This paper presents an insightful evaluation of ChatGPT's role in augmenting training data for cross-encoder re-rankers, elucidating both the promise and pitfalls of relying on generative models for synthetic document production. The robust empirical analysis substantiates the capacity of LLMs to enhance retrieval systems, particularly in zero-shot settings, thus enriching the field's understanding of data augmentation strategies. Prospective research could explore the nuanced interactions between LLM-generated content and cross-encoder algorithms, ultimately refining retrieval processes in AI-driven contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Arian Askari (19 papers)
Mohammad Aliannejadi (85 papers)
Evangelos Kanoulas (79 papers)
Suzan Verberne (57 papers)

Citations (9)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - arian-askari/ChatGPT-RetrievalQA: A dataset for training/evaluating Question Answering Retrieval models on ChatGPT responses with the possibility to training/evaluating on real human responses. (142 stars)