Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Knowledge Graph Fusion for Language Model Fine-tuning (2206.14574v1)

Published 21 Jun 2022 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs such as BERT have grown in popularity due to their ability to be pre-trained and perform robustly on a wide range of Natural Language Processing tasks. Often seen as an evolution over traditional word embedding techniques, they can produce semantic representations of text, useful for tasks such as semantic similarity. However, state-of-the-art models often have high computational requirements and lack global context or domain knowledge which is required for complete language understanding. To address these limitations, we investigate the benefits of knowledge incorporation into the fine-tuning stages of BERT. An existing K-BERT model, which enriches sentences with triplets from a Knowledge Graph, is adapted for the English language and extended to inject contextually relevant information into sentences. As a side-effect, changes made to K-BERT for accommodating the English language also extend to other word-based languages. Experiments conducted indicate that injected knowledge introduces noise. We see statistically significant improvements for knowledge-driven tasks when this noise is minimised. We show evidence that, given the appropriate task, modest injection with relevant, high-quality knowledge is most performant.

This paper investigates the benefits of knowledge incorporation into the fine-tuning stages of Bidirectional Encoder Representations from Transformers (BERT) for NLP tasks. The authors adapt the Knowledge-BERT (K-BERT) model, which enriches sentences with triplets from a Knowledge Graph, for the English language and extend it to inject contextually relevant information into sentences. The core idea revolves around addressing the limitations of LLMs (LMs), such as high computational requirements and a lack of global context or domain knowledge necessary for complete language understanding.

The key contributions and findings include:

  • Adapting K-BERT to the English domain and other word-based languages.
  • Modifying the KQueryK_\text{Query} mechanism to inject semantically related information.
  • Demonstrating that inclusion of external knowledge can introduce noise.
  • Showing that, in the absence of noise, external knowledge injection benefits knowledge-driven tasks.

The paper extends the original K-BERT model [liu2019kbert] by modifying the KQueryK_\text{Query} mechanism to consider semantically important information and assesses its performance on both open-domain and domain-specific tasks. It utilizes Wikidata as the Knowledge Graph in a loosely coupled manner to ensure interchangeability with other knowledge sources and examines the type of knowledge most beneficial to the fine-tuning process using ablation studies.

The methodology involves several key steps:

  1. Knowledge Graph: The paper employs Wikidata as the primary knowledge source, stored in a triplet format (subject, predicate, object). The data is preprocessed to reduce its size by considering only English data items in domains like business, sports, humans, cities, and countries. The properties used are restricted to {label, alias, description, subclass of, instance of} to maximize descriptive detail while minimizing storage requirements.
  2. Term-based Sentence Tree: The model inputs, such as the sentence tree and visible matrix, are adapted to accommodate the alphabetic English language on a word level. The knowledge injection is done per group of tokens instead of a single token. Given an input sentence s=[w0,w1,,wn]s = [w_0,w_1,\ldots,w_n], contiguous related tokens wow_o to wqw_q are grouped together. Knowledge is injected per group of related tokens to produce an output sentence tree tt:

    t=[w0,w1,, wo,,wq[(rop0,wop0),,(ropk,wopk)],, wn]t = \Big[{w_0}, {w_1},\ldots, \Big.\ {w_o, \ldots, w_q}\big[(r_{op0}, w_{op0}), \ldots, (r_{opk}, w_{opk})\big], \ldots,\ \Big. {w_n}\Big]wherewVw \in Vis the set of entities in the Knowledge GraphK\mathrm{K},rVr \in Vis the set of relations/properties in the Knowledge Graph, andkk represents the number of triplets inserted.

  3. Contextualized Knowledge Injection: Named entity recognition is performed to extract entities from the input sentence ss. The extracted entities are queried from the Knowledge Graph to retrieve a list of triplets E=KQuery(s,K)E = K_\text{Query}(s, \mathrm{K}). Additional processing to find the most relevant triplets is done before injection occurs. A pre-trained Transformer model TT generates contextualized embeddings embi\operatorname{emb}_i for each sequence seqi\operatorname{seq}_i using:

    embi=T(seqi)\operatorname{emb}_i = T(\operatorname{seq}_i)

    Similarity between embeddings is computed using cosine similarity metric ,cos\left\|\cdot, \cdot\right\|_\text{cos}. The entity corresponding to the most similar embedding is selected to be injected using:

    max_pos=arg maxiI(embi,embscos)\operatorname{max\_pos} = \argmax_{i \in I}\left(\left\|\operatorname{emb}_i, \operatorname{emb}_s\right\|_\text{cos}\right)

    where II is the set of distinct entities. A threshold parameter is introduced such that information is only injected if embi,embscos>threshold\left\|\operatorname{emb}_i, \operatorname{emb}_s\right\|_\text{cos} > \text{threshold}.

  4. Corrections & Optimizations: Sequence truncation is modified to be more "equal" when sentence pairs are fed into K-BERT. The number of tokens in S1 and S2 is each limited to size max_length/2\operatorname{max\_length}/2. Memory usage is optimized by intermediately storing non-duplicate entries from the visible matrix into a vector of size N(N+1)/2N (N + 1)/2.

The evaluation focuses on semantic similarity tasks using the Semantic Textual Similarity Benchmark (STS-B) and the ag_news_subset dataset. The experimental setup involves fine-tuning BERT and K-BERT for ten epochs, using batch sizes of $16$ and $32$ for STS-B and ag_news_subset, respectively. The best-performing learning rates were 4e54e^{-5} and 5e55e^{-5} for STSb and ag_news_subset, respectively. The threshold parameter was set to $0.5$ for STS-B and $0.6$ for ag_news_subset. A maximum sequence length of $128$ for ag_news_subset and 256 for STS-B was used.

Experiments include:

  • Knowledge Ablation: The type of knowledge injected (aliases, categorical, and descriptive information) is excluded from K-BERT and experiments are rerun.
  • Knowledge-Gating: Information is only injected into sentences when sequence lengths are below the maximum sequence length to avoid truncation.
  • Manual Knowledge Injection: A manual selection of knowledge is performed to identify deficiencies in the automated similarity-based approach for relevant knowledge fusion.

The results indicate that for STS-B, the addition of knowledge in the BERT model leads to an overall reduction in performance, implying that the knowledge introduces noise. After performing Knowledge Ablation, overall performance improves marginally compared to the standard K-BERT. For ag_news_subset, K-BERT has an improved average test accuracy compared to BERT. Descriptive and categorical information appear to be the most beneficial. $\text{K-BERT}_{\text{MANUAL}$ produces the best results compared to every K-BERT variation as well as the original BERT model, with a 0.7%0.7\% improvement over BERT.

Statistical significance is assessed using one-tail Student t-tests. The t-test results show that the difference is statistically insignificant for ag_news_subset, but with a pvalue=0.0p-value=0.0, $\text{K-BERT}_{\text{MANUAL}$ has shown a statistically significant benefit for the inclusion of knowledge.

The paper concludes that the fusion of knowledge from the Wikidata Knowledge Graph has potential benefits, but all autonomous approaches and knowledge types introduce some noise which causes a decline in performance. Minimizing the amount of noise shows no benefit from the fusion of knowledge for the STS-B dataset, while the ag_news_subset dataset produces a 0.7%0.7\% improvement over BERT. The authors suggest that given the appropriate problem, injecting knowledge sparingly with relevant, high-quality information is preferable. Future work should explore advanced contextual mechanisms which consider additional factors other than similarity and pre-disambiguation steps.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Nimesh Bhana (1 paper)
  2. Terence L. van Zyl (22 papers)