Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval (2010.10137v3)

Published 20 Oct 2020 in cs.IR

Abstract: Recently pre-trained language representation models such as BERT have shown great success when fine-tuned on downstream tasks including information retrieval (IR). However, pre-training objectives tailored for ad-hoc retrieval have not been well explored. In this paper, we propose Pre-training with Representative wOrds Prediction (PROP) for ad-hoc retrieval. PROP is inspired by the classical statistical LLM for IR, specifically the query likelihood model, which assumes that the query is generated as the piece of text representative of the "ideal" document. Based on this idea, we construct the representative words prediction (ROP) task for pre-training. Given an input document, we sample a pair of word sets according to the document LLM, where the set with higher likelihood is deemed as more representative of the document. We then pre-train the Transformer model to predict the pairwise preference between the two word sets, jointly with the Masked LLM (MLM) objective. By further fine-tuning on a variety of representative downstream ad-hoc retrieval tasks, PROP achieves significant improvements over baselines without pre-training or with other pre-training methods. We also show that PROP can achieve exciting performance under both the zero- and low-resource IR settings. The code and pre-trained models are available at https://github.com/Albert-Ma/PROP.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinyu Ma (49 papers)
  2. Jiafeng Guo (161 papers)
  3. Ruqing Zhang (60 papers)
  4. Yixing Fan (55 papers)
  5. Xiang Ji (71 papers)
  6. Xueqi Cheng (274 papers)
Citations (94)

Summary

Overview of PROP: Pre-training with Representative Words Prediction for Ad-hoc Retrieval

The paper introduces PROP, a novel pre-training approach specifically designed to enhance performance on ad-hoc retrieval tasks. Traditional pre-trained LLMs such as BERT, while successful in many NLP applications, are not specifically tailored for information retrieval (IR). PROP seeks to bridge this gap with a pre-training objective that directly reflects the relevance relationship inherent in ad-hoc retrieval scenarios.

Methodology

PROP's central innovation is the Representative Words Prediction (ROP) task, rooted in the classical statistical LLM, particularly the query likelihood model. The query likelihood model suggests that a user's query is a representative snippet derived from an "ideal" document. Utilizing this foundational idea, PROP's approach involves two main components:

  1. Representative Word Sets Sampling: From each document in the pre-training corpus, a pair of word sets is sampled based on the document's LLM. The goal is to align these word sets with how an "ideal" query would extract representative words from the document, employing a multinomial unigram LLM with Dirichlet priors for smooth estimations. Each word set's likelihood is calculated, with the higher-likelihood set being more representative.
  2. Transformers for Pairwise Preference: A Transformer model is pre-trained to predict which of the two word sets is more representative of the original document. This ROP task is combined with the Masked LLM (MLM) objective, enhancing the model's ability to recognize and utilize contextual clues in retrieval scenarios.

Experimental Results

Evaluation across several benchmark ad-hoc retrieval datasets—Robust04, ClueWeb09-B, Gov2, MQ2007, and MQ2008—shows significant performance improvements when using PROP versus other pre-training methods or baselines like BM25 and BERT. Notable findings include:

  • Performance Across Domains: The adaptability of PROP is demonstrated through its substantial improvements across datasets with various document types, such as news articles and web pages. This highlights PROP's robustness and effectiveness due to its generalized pre-training approach that does not depend on specific document structures like hyperlinks.
  • Zero- and Low-Resource Settings: One of PROP's key advantages is its proficiency in zero- and low-resource environments, where labeled training data is scarce. PROP maintains competitive performance even with limited fine-tuning data, showcasing its potential for practical applications where data acquisition is challenging.

Implications and Future Research

The implications of this research are multifold:

  • Theoretical Integration with IR Practices: By leveraging the theoretical underpinnings of classical IR models, PROP offers a systematic method to pre-train models that directly enhance ad-hoc retrieval tasks, indicating a fruitful direction that harmonizes deep learning with established IR concepts.
  • Expansion to Other IR Tasks: While focused on document retrieval, the methodology could be extended to other IR-related tasks such as passage retrieval and dialogue systems, offering a unified pre-training paradigm.
  • Impact on Supervision Efficiency: PROP demonstrates the possibility of achieving high retrieval effectiveness with minimal supervision, which might push the boundaries of semi-supervised or unsupervised learning in IR contexts.

In conclusion, PROP provides a nuanced and theoretically sound method for pre-training LLMs in the context of IR. It aligns deep learning techniques with traditional IR theories to produce models that deliver improved performance without being heavily reliant on data-specific structures. Future work could investigate further optimization of the ROP objective and explore its application across a wider variety of retrieval tasks.

Github Logo Streamline Icon: https://streamlinehq.com