Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model (1912.09637v1)

Published 20 Dec 2019 in cs.CL

Abstract: Recent breakthroughs of pretrained LLMs have shown the effectiveness of self-supervised learning for a wide range of NLP tasks. In addition to standard syntactic and semantic NLP tasks, pretrained models achieve strong improvements on tasks that involve real-world knowledge, suggesting that large-scale LLMing could be an implicit method to capture knowledge. In this work, we further investigate the extent to which pretrained models such as BERT capture knowledge using a zero-shot fact completion task. Moreover, we propose a simple yet effective weakly supervised pretraining objective, which explicitly forces the model to incorporate knowledge about real-world entities. Models trained with our new objective yield significant improvements on the fact completion task. When applied to downstream tasks, our model consistently outperforms BERT on four entity-related question answering datasets (i.e., WebQuestions, TriviaQA, SearchQA and Quasar-T) with an average 2.7 F1 improvements and a standard fine-grained entity typing dataset (i.e., FIGER) with 5.7 accuracy gains.

Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained LLM

The research article titled "Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained LLM" presents a novel approach to enhancing the capability of pretrained LLMs to encode and retrieve real-world encyclopedic knowledge. This endeavor is motivated by the observation that self-supervised pretrained models like BERT exhibit impressive performance on tasks requiring real-world knowledge alongside typical syntactic and semantic language understanding.

Weakly Supervised Pretraining Methodology

The central proposition of this work is the introduction of a weakly supervised pretraining objective called entity replacement training. Unlike traditional approaches that depend heavily on structured knowledge bases, this method relies directly on unstructured text data such as Wikipedia. The process involves recognizing and replacing entity mentions in the text with other entities of the same type, thereby generating negative samples to force the model to distinguish true knowledge from the false. This entity-centric approach promotes the learning of entity-level knowledge during the pretraining phase.

Evaluation and Results

The efficacy of the proposed model (referred to as WKLM) is evaluated using a zero-shot fact completion task across ten common Wikidata relations. WKLM significantly outperforms its contemporaries, BERT and GPT-2, achieving the most robust performance overall. This fact completion step acts as a proxy to assess how well the model can fill in missing entity-based knowledge, akin to knowledge base completion tasks.

WKLM is further tested on real-world applications requiring entity knowledge: open-domain question answering and fine-grained entity typing. Across multiple question answering datasets such as WebQuestions, TriviaQA, Quasar-T, and SearchQA, WKLM yields notable improvements over BERT models, particularly when incorporating paragraph ranking scores. Furthermore, WKLM sets a new standard on the FIGER dataset for entity typing, demonstrating the practical advantages of the proposed pretraining strategy in capturing and utilizing entity-centric knowledge.

Implications and Future Work

This paper offers an important advancement in how LLMs can be trained to internalize and leverage entity-based knowledge without structured databases or additional memory requirements during downstream tasks. The implicit encoding of real-world knowledge suggests future directions, where pretrained models could be further enriched, leading to improved performance across a broader spectrum of NLP tasks, particularly those involving complex entity reasoning.

Future work could explore larger corpora beyond Wikipedia, employ advanced entity linking methods for improved precision, and examine the integration and balance between entity-focused objectives and standard masked LLMing. As the field moves towards building more knowledgeable AI systems, methods such as WKLM provide a promising path to bridging the gap between language comprehension and encyclopedic understanding.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wenhan Xiong (47 papers)
  2. Jingfei Du (16 papers)
  3. William Yang Wang (254 papers)
  4. Veselin Stoyanov (21 papers)
Citations (196)