Do LLMs Dream of Ontologies? (2401.14931v1)

Published 26 Jan 2024 in cs.CL and cs.AI

Abstract: LLMs have recently revolutionized automated text understanding and generation. The performance of these models relies on the high number of parameters of the underlying neural architectures, which allows LLMs to memorize part of the vast quantity of data seen during the training. This paper investigates whether and to what extent general-purpose pre-trained LLMs have memorized information from known ontologies. Our results show that LLMs partially know ontologies: they can, and do indeed, memorize concepts from ontologies mentioned in the text, but the level of memorization of their concepts seems to vary proportionally to their popularity on the Web, the primary source of their training material. We additionally propose new metrics to estimate the degree of memorization of ontological information in LLMs by measuring the consistency of the output produced across different prompt repetitions, query languages, and degrees of determinism.

PDF Abstract

Introduction

The capabilities of LLMs in text understanding and generation have garnered significant attention in both commercial and academic spheres. These models rely heavily on their expansive neural architectures, enabling memorization of substantial amounts of data encountered during training. A key inquiry pertains to the models' ability to retain and recall information from established ontologies, a critical aspect given the role ontologies play in structuring domain-specific knowledge.

Memorization Evaluation

Recent examinations into LLMs, such as GPT and PYTHIA-12B, seek to determine to what extent these models have internalized ontological concepts without the need for additional training. Through a series of experiments leveraging the Gene and Uberon ontologies, it has been demonstrated that LLMs possess a partial repository of ontological knowledge. However, the memorization of concepts is inconsistent and appears to be influenced by the frequency of these concepts across the Web. The findings suggest that popular concepts are more likely to be memorized, indicating that LLMs possibly learn from textual material mentioning these concepts rather than the ontological resources themselves.

Analysis of Error Patterns

An exploration of error patterns showed that the LLMs' wrong predictions still exhibit some syntactical similarity to the correct ontology entries. There is an observed pattern where incorrectly predicted concept IDs maintain a syntactic proximity to the correct IDs or their associated labels. This phenomenon becomes more pronounced when the incorrect predictions are for concepts with higher web presence.

Prediction Invariance as a Memorization Metric

The paper introduces novel metrics to ascertain an LLM's retention of ontological data, with particular emphasis on the uniformity of output produced across variable prompts. By altering prompt repetitions, query languages, and temperature levels, researchers have found a distinct correlation between prediction invariance and concept frequency on the web. This correlation suggests that consistent model outputs, irrespective of prompt perturbations, could serve as strong indicators of concept memorization within LLMs.

Conclusion

LLMs reflect an emerging understanding of known ontologies, with their memorization capabilities directly influenced by the prevalence of concepts in web-based materials. While these models have internalized a subset of ontological concepts, complete and uniform memorization remains unattained. The methodologies proposed and implemented in this research to measure ontological memorization might pave the way for future in-depth explorations of how LLMs interact with structured domain knowledge.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Marco Bombieri (2 papers)
Paolo Fiorini (25 papers)
Simone Paolo Ponzetto (52 papers)
Marco Rospocher (4 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/dromanocpm/status/1752076113658966063

https://twitter.com/chrismungall/status/1752936909939376494