Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph (2404.03623v2)
Abstract: LLMs demonstrate an impressive capacity to recall a vast range of factual knowledge. However, understanding their underlying reasoning and internal mechanisms in exploiting this knowledge remains a key research area. This work unveils the factual information an LLM represents internally for sentence-level claim verification. We propose an end-to-end framework to decode factual knowledge embedded in token representations from a vector space to a set of ground predicates, showing its layer-wise evolution using a dynamic knowledge graph. Our framework employs activation patching, a vector-level technique that alters a token representation during inference, to extract encoded knowledge. Accordingly, we neither rely on training nor external models. Using factual and common-sense claims from two claim verification datasets, we showcase interpretability analyses at local and global levels. The local analysis highlights entity centrality in LLM reasoning, from claim-related information and multi-hop reasoning to representation errors causing erroneous evaluation. On the other hand, the global reveals trends in the underlying evolution, such as word-based knowledge evolving into claim-related facts. By interpreting semantics from LLM latent representations and enabling graph-related analyses, this work enhances the understanding of the factual knowledge resolution process.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, pp. 2, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence, 17(8):790–799, 1995.
- Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024.
- Neural probabilistic logic programming in discrete-continuous domains. In Uncertainty in Artificial Intelligence, pp. 529–538. PMLR, 2023.
- Climate-fever: A dataset for verification of real-world climate claims. Tackling Climate Change with Machine Learning workshop at NeurIPS 2020, 2020.
- Patchscopes: A unifying framework for inspecting hidden representations of language models, 2024.
- Linearity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124, 2023.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438, 2020.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9802–9822, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.546. URL https://aclanthology.org/2023.acl-long.546.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
- Chris Olah. Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases, 2022. URL https://transformer-circuits.pub/2022/mech-interp-essay/index.html.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Language models as knowledge bases? arXiv preprint arXiv:1909.01066, 2019.
- Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM ’20), pp. 3125–3132. ACM, 2020.
- Multi-scale attributed node embedding. Journal of Complex Networks, 9(2):cnab014, 2021.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- College: Concept embedding generation for large language models. arXiv preprint arXiv:2403.15362, 2024.
- FEVER: a large-scale dataset for fact extraction and VERification. In NAACL-HLT, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837, 2024.
- Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023.
- Marco Bronzini (2 papers)
- Carlo Nicolini (8 papers)
- Bruno Lepri (120 papers)
- Jacopo Staiano (38 papers)
- Andrea Passerini (72 papers)