Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LinkBERT: Pretraining Language Models with Document Links (2203.15827v1)

Published 29 Mar 2022 in cs.CL and cs.LG
LinkBERT: Pretraining Language Models with Document Links

Abstract: LLM (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked LLMing and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE). We release our pretrained models, LinkBERT and BioLinkBERT, as well as code and data at https://github.com/michiyasunaga/LinkBERT.

LinkBERT: Pretraining LLMs with Document Links

In the paper "LinkBERT: Pretraining LLMs with Document Links," the authors introduce a novel pretraining method for LLMs (LMs) that leverages inter-document links to enhance performance on downstream tasks. The focus is to extend existing LM capabilities beyond isolated documents by capturing interconnected knowledge through links such as hyperlinks in general domains or citation links in biomedical literature.

Core Methodology

The authors propose LinkBERT, a strategy that treats a text corpus as a graph where nodes represent documents and edges embody links between them. This approach integrates linked documents into LM inputs during the pretraining phase, supplementing the conventional methods that typically only consider single document contexts.

Two primary self-supervised objectives are utilized in this framework:

  1. Masked LLMing (MLM): Similar to BERT, MLM encourages the model to predict masked tokens within the input sequence, now expanded to include context from linked documents.
  2. Document Relation Prediction (DRP): This novel objective trains the model to classify the relationship between document pairs (e.g., whether they are contiguous, linked, or random), facilitating a deeper understanding of document relevance and relations.

These methods collectively enable the model to internalize expanded knowledge across documents, thus enhancing reasoning and comprehension capabilities.

Empirical Results

The effectiveness of LinkBERT is demonstrated through substantial performance improvements on several NLP tasks, particularly in contexts requiring multi-hop reasoning and comprehension across multiple documents. Key findings include:

  • General Domain: Pretrained on Wikipedia with hyperlinks, LinkBERT shows marked improvements over BERT on the MRQA benchmark and GLUE tasks, particularly excelling in datasets like HotpotQA and TriviaQA, which require reasoning with multiple sources.
  • Biomedical Domain: Using PubMed with citation links, the biomedical variant, BioLinkBERT, sets new performance standards on the BLURB benchmark and MedQA-USMLE, accentuating its superior capacity for handling domain-specific, knowledge-intensive tasks.

Implications and Future Directions

The approach put forth in LinkBERT offers significant implications for the development of LLMs:

  • Enhanced Comprehension Across Documents: By utilizing inter-document links, models become proficient in grasping extended knowledge networks, crucial for domains relying heavily on interconnected information, such as scientific literature and web-based corpora.
  • Versatile Pretraining Framework: LinkBERT provides a versatile structure that can be adapted to various linkage types beyond hyperlinks and citations, potentially extending to other domains where document relations are prevalent.
  • Applications in Retrieval-Augmented Systems: The document relation understanding fostered by DRP can benefit retrieval-augmented systems, enhancing tasks like open-domain question answering, where discerning relevant context from a mix of documents is vital.

Conclusion

LinkBERT introduces an innovative direction in LM pretraining by incorporating document links, demonstrating significant performance boosts across multiple domains and tasks. The method not only enriches the knowledge base captured by LMs but also offers a robust pathway for future explorations in graph-augmented linguistic models. This paves the way for deep inter-document comprehension, setting a solid foundation for advancements in information-rich environments and domain-specific applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Michihiro Yasunaga (48 papers)
  2. Jure Leskovec (233 papers)
  3. Percy Liang (239 papers)
Citations (317)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com