Training Language Models on the Knowledge Graph: Insights on Hallucinations and Their Detectability (2408.07852v1)

Published 14 Aug 2024 in cs.CL, cs.AI, and cs.LG

Abstract: While many capabilities of LLMs (LMs) improve with increased training budget, the influence of scale on hallucinations is not yet fully understood. Hallucinations come in many forms, and there is no universally accepted definition. We thus focus on studying only those hallucinations where a correct answer appears verbatim in the training set. To fully control the training data content, we construct a knowledge graph (KG)-based dataset, and use it to train a set of increasingly large LMs. We find that for a fixed dataset, larger and longer-trained LMs hallucinate less. However, hallucinating on $\leq5$% of the training data requires an order of magnitude larger model, and thus an order of magnitude more compute, than Hoffmann et al. (2022) reported was optimal. Given this costliness, we study how hallucination detectors depend on scale. While we see detector size improves performance on fixed LM's outputs, we find an inverse relationship between the scale of the LM and the detectability of its hallucinations.

Authors (31)

Jiri Hron (19 papers)
Laura Culp (8 papers)
Gamaleldin Elsayed (6 papers)
Rosanne Liu (25 papers)
Ben Adlam (25 papers)
Maxwell Bileschi (2 papers)
Bernd Bohnet (21 papers)
JD Co-Reyes (3 papers)
Noah Fiedel (22 papers)
C. Daniel Freeman (22 papers)
Izzeddin Gur (23 papers)
Kathleen Kenealy (11 papers)
Jaehoon Lee (62 papers)
Peter J. Liu (30 papers)
Gaurav Mishra (14 papers)
Igor Mordatch (66 papers)
Azade Nova (13 papers)
Roman Novak (22 papers)
Aaron Parisi (8 papers)
Jeffrey Pennington (45 papers)

Summary

Analysis of "Training LLMs on the Knowledge Graph: Insights on Hallucinations and Their Detectability"

In the paper “Training LLMs on the Knowledge Graph: Insights on Hallucinations and Their Detectability,” the authors explore the influence of model scale on hallucinations in LLMs (LMs) trained on structured knowledge graph (KG) data. The paper is particularly insightful as it nuances our understanding of model hallucinations, examining them in a controlled environment where information content is clearly defined.

Hallucination and Model Scale

The authors focus on a specific kind of hallucination—where model-produced responses do not match any factual content verbatim present in the training data. To paper the impact of model scale and training epochs on these hallucinations, LMs are trained from scratch on data derived from a KG. The knowledge graph encapsulates knowledge in the form of [subject, predicate, object] triples, which are converted into training data through concatenation and tokenization processes.

Key Findings

Scaling Dynamics: A key finding is that larger models and those trained over more epochs tend to hallucinate less on seen data. However, eliminating hallucinations on even 5% of the data demands substantially more computational resources—an order of magnitude larger than the conventional optimal usage.
Dataset Size Impact: Differing from typical scaling laws, increasing dataset size exacerbates the hallucination problem for a fixed LM size. This is because more dataset entries increase the exhaustive set of facts that need to be learned and potentially memorized.
Epoch Requirements: Multi-epoch training is crucial in reducing hallucination rates. In the explored setup, minimal hallucinations on training data are reached only after training the LM over 20 epochs or more, highlighting the tension between optimally balanced training length and model generalization capabilities.
Temperature Trade-off: The paper also illustrates the effect of sampling temperature on hallucination rates. Lowering temperature can reduce hallucinations but at the expense of recall, raising concerns about overly conservative model behaviors.

Detectability of Hallucinations

The detectability of hallucinations was another focus of the paper. The authors use detectors, probing both sentence-level and token-level hallucinations:

Detector Types: Two types of detectors are used: simple heads appended to the pretrained LM, and full models where the entire LM is fine-tuned to detect hallucinations after initial training.
Detection and Model Size: A notable and somewhat counterintuitive discovery is that while larger LMs tend to hallucinate less, the hallucinations they do produce become harder to detect. This inversely proportional relationship between model size and detectability challenges assumptions on progressive scaling benefits.
Precision-Recall Analysis: Through precision-recall evaluation, the paper confirms the superior performance of sentence-level detectors over token-level ones in terms of AUC-PR, although this could vary significantly depending on specific model and detector configurations.

Implications and Future Directions

The paper’s findings have significant implications:

Optimal Scaling: Findings challenge existing beliefs regarding optimal resource allocation in LM training, suggesting alternative strategies such as retrieval augmentation might be required.
Trade-offs in Training: The outlined trade-offs between hallucination rates and model capabilities point towards a nuanced approach to model training that balances between memorization and generalization.
Detector Development: The scaling challenges in hallucination detection highlight the need for innovative solutions that can efficiently manage the complexity involved in detection as models grow in size.

Conclusion

The paper pushes forward the discussion on LM hallucinations by investigating the limits and dynamics of hallucination in a thoroughly controlled knowledge graph setting. While offering valuable insights, it also opens up new questions on how we define, measure, and ultimately mitigate hallucinations in modern LLMs. This nuanced understanding is a crucial stepping stone in advancing the trustworthiness and applicability of LMs in real-world applications.