This paper investigates the benefits of knowledge incorporation into the fine-tuning stages of Bidirectional Encoder Representations from Transformers (BERT) for NLP tasks. The authors adapt the Knowledge-BERT (K-BERT) model, which enriches sentences with triplets from a Knowledge Graph, for the English language and extend it to inject contextually relevant information into sentences. The core idea revolves around addressing the limitations of LLMs (LMs), such as high computational requirements and a lack of global context or domain knowledge necessary for complete language understanding.
The key contributions and findings include:
- Adapting K-BERT to the English domain and other word-based languages.
- Modifying the mechanism to inject semantically related information.
- Demonstrating that inclusion of external knowledge can introduce noise.
- Showing that, in the absence of noise, external knowledge injection benefits knowledge-driven tasks.
The paper extends the original K-BERT model [liu2019kbert] by modifying the mechanism to consider semantically important information and assesses its performance on both open-domain and domain-specific tasks. It utilizes Wikidata as the Knowledge Graph in a loosely coupled manner to ensure interchangeability with other knowledge sources and examines the type of knowledge most beneficial to the fine-tuning process using ablation studies.
The methodology involves several key steps:
- Knowledge Graph: The paper employs Wikidata as the primary knowledge source, stored in a triplet format (subject, predicate, object). The data is preprocessed to reduce its size by considering only English data items in domains like business, sports, humans, cities, and countries. The properties used are restricted to {label, alias, description, subclass of, instance of} to maximize descriptive detail while minimizing storage requirements.
- Term-based Sentence Tree: The model inputs, such as the sentence tree and visible matrix, are adapted to accommodate the alphabetic English language on a word level. The knowledge injection is done per group of tokens instead of a single token. Given an input sentence , contiguous related tokens to are grouped together. Knowledge is injected per group of related tokens to produce an output sentence tree :
whereis the set of entities in the Knowledge Graph,is the set of relations/properties in the Knowledge Graph, and represents the number of triplets inserted.
- Contextualized Knowledge Injection: Named entity recognition is performed to extract entities from the input sentence . The extracted entities are queried from the Knowledge Graph to retrieve a list of triplets . Additional processing to find the most relevant triplets is done before injection occurs. A pre-trained Transformer model generates contextualized embeddings for each sequence using:
Similarity between embeddings is computed using cosine similarity metric . The entity corresponding to the most similar embedding is selected to be injected using:
where is the set of distinct entities. A threshold parameter is introduced such that information is only injected if .
- Corrections & Optimizations: Sequence truncation is modified to be more "equal" when sentence pairs are fed into K-BERT. The number of tokens in S1 and S2 is each limited to size . Memory usage is optimized by intermediately storing non-duplicate entries from the visible matrix into a vector of size .
The evaluation focuses on semantic similarity tasks using the Semantic Textual Similarity Benchmark (STS-B) and the ag_news_subset dataset. The experimental setup involves fine-tuning BERT and K-BERT for ten epochs, using batch sizes of $16$ and $32$ for STS-B and ag_news_subset, respectively. The best-performing learning rates were and for STSb and ag_news_subset, respectively. The threshold parameter was set to $0.5$ for STS-B and $0.6$ for ag_news_subset. A maximum sequence length of $128$ for ag_news_subset and 256 for STS-B was used.
Experiments include:
- Knowledge Ablation: The type of knowledge injected (aliases, categorical, and descriptive information) is excluded from K-BERT and experiments are rerun.
- Knowledge-Gating: Information is only injected into sentences when sequence lengths are below the maximum sequence length to avoid truncation.
- Manual Knowledge Injection: A manual selection of knowledge is performed to identify deficiencies in the automated similarity-based approach for relevant knowledge fusion.
The results indicate that for STS-B, the addition of knowledge in the BERT model leads to an overall reduction in performance, implying that the knowledge introduces noise. After performing Knowledge Ablation, overall performance improves marginally compared to the standard K-BERT. For ag_news_subset, K-BERT has an improved average test accuracy compared to BERT. Descriptive and categorical information appear to be the most beneficial. $\text{K-BERT}_{\text{MANUAL}$ produces the best results compared to every K-BERT variation as well as the original BERT model, with a improvement over BERT.
Statistical significance is assessed using one-tail Student t-tests. The t-test results show that the difference is statistically insignificant for ag_news_subset, but with a , $\text{K-BERT}_{\text{MANUAL}$ has shown a statistically significant benefit for the inclusion of knowledge.
The paper concludes that the fusion of knowledge from the Wikidata Knowledge Graph has potential benefits, but all autonomous approaches and knowledge types introduce some noise which causes a decline in performance. Minimizing the amount of noise shows no benefit from the fusion of knowledge for the STS-B dataset, while the ag_news_subset dataset produces a improvement over BERT. The authors suggest that given the appropriate problem, injecting knowledge sparingly with relevant, high-quality information is preferable. Future work should explore advanced contextual mechanisms which consider additional factors other than similarity and pre-disambiguation steps.