Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pre-training for Information Retrieval: Are Hyperlinks Fully Explored? (2209.06583v1)

Published 14 Sep 2022 in cs.IR, cs.AI, and cs.CL

Abstract: Recent years have witnessed great progress on applying pre-trained LLMs, e.g., BERT, to information retrieval (IR) tasks. Hyperlinks, which are commonly used in Web pages, have been leveraged for designing pre-training objectives. For example, anchor texts of the hyperlinks have been used for simulating queries, thus constructing tremendous query-document pairs for pre-training. However, as a bridge across two web pages, the potential of hyperlinks has not been fully explored. In this work, we focus on modeling the relationship between two documents that are connected by hyperlinks and designing a new pre-training objective for ad-hoc retrieval. Specifically, we categorize the relationships between documents into four groups: no link, unidirectional link, symmetric link, and the most relevant symmetric link. By comparing two documents sampled from adjacent groups, the model can gradually improve its capability of capturing matching signals. We propose a progressive hyperlink predication ({PHP}) framework to explore the utilization of hyperlinks in pre-training. Experimental results on two large-scale ad-hoc retrieval datasets and six question-answering datasets demonstrate its superiority over existing pre-training methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Jiawen Wu (9 papers)
  2. Xinyu Zhang (296 papers)
  3. Yutao Zhu (63 papers)
  4. Zheng Liu (312 papers)
  5. Zikai Guo (3 papers)
  6. Zhaoye Fei (15 papers)
  7. Ruofei Lai (13 papers)
  8. Yongkang Wu (12 papers)
  9. Zhao Cao (36 papers)
  10. Zhicheng Dou (113 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.