Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs (2404.00942v1)

Published 1 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The advent of LLMs has significantly transformed the AI landscape, enhancing machine learning and AI capabilities. Factuality issue is a critical concern for LLMs, as they may generate factually incorrect responses. In this paper, we propose GraphEval to evaluate an LLM's performance using a substantially large test dataset. Specifically, the test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts. Unlike conventional methods that evaluate LLMs based on generated responses, GraphEval streamlines the evaluation process by creating a judge model to estimate the correctness of the answers given by the LLM. Our experiments demonstrate that the judge model's factuality assessment aligns closely with the correctness of the LLM's generated outputs, while also substantially reducing evaluation costs. Besides, our findings offer valuable insights into LLM performance across different metrics and highlight the potential for future improvements in ensuring the factual integrity of LLM outputs. The code is publicly available at https://github.com/xz-liu/GraphEval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Dbpedia: A nucleus for a web of open data. In International Semantic Web Conference, pp.  722–735. Springer, 2007.
  2. The internal state of an llm knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  967–976, 2023.
  3. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
  4. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
  5. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp.  1247–1250, 2008.
  6. Biomedlm: A 2.7b parameter language model trained on biomedical text, 2024.
  7. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013.
  8. Toward an architecture for never-ending language learning. In Proceedings of the AAAI conference on artificial intelligence, volume 24, pp.  1306–1313, 2010.
  9. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
  10. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. arXiv preprint arXiv:1611.03954, 2016.
  11. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7870–7881, 2020.
  12. FELM: Benchmarking factuality evaluation of large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023a.
  13. Exploring the potential of large language models (llms) in learning on graphs, 2024a.
  14. Zero-shot visual question answering using knowledge graph. In The Semantic Web–ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings 20, pp.  146–162. Springer, 2021.
  15. Tele-knowledge pre-training for fault analysis. In 2023 IEEE 39th International Conference on Data Engineering (ICDE), pp.  3453–3466. IEEE, 2023b.
  16. Knowledge graphs meet multi-modal learning: A comprehensive survey. arXiv preprint arXiv:2402.05391, 2024b.
  17. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  18. Mixture-of-domain-adapters: Decoupling and injecting domain knowledge to pre-trained language models memories, 2023.
  19. Mingjie Liu et al. Chipnemo: Domain-adapted llms for chip design, 2024.
  20. Factkb: Generalizable factuality evaluation using language models enhanced with factual knowledge. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  933–952, 2023.
  21. Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770, 2023.
  22. Clusterea: Scalable entity alignment with stochastic training and normalized mini-batch similarities. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  421–431, 2022.
  23. Largeea: Aligning entities for large-scale knowledge graphs. Proceedings of the VLDB Endowment, 15(2):237–245, 2021a.
  24. Make it easy: An effective end-to-end entity alignment framework. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  777–786, 2021b.
  25. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
  26. Distributed representations of entities in open-world knowledge graphs. Knowledge-Based Systems, pp.  111582, 2024.
  27. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  28. Do large language models know about facts? arXiv preprint arXiv:2310.05177, 2023.
  29. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  30. Tempquestions: A benchmark for temporal question answering. In Companion Proceedings of the The Web Conference 2018, pp.  1057–1062, 2018.
  31. Reasoninglm: Enabling structural subgraph reasoning in pre-trained language models for question answering over knowledge graph. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  3721–3735, 2023.
  32. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, 2017.
  33. Realtime qa: What’s the answer right now? Advances in Neural Information Processing Systems, 36, 2024.
  34. Kg-gpt: A general framework for reasoning on knowledge graphs using large language models. arXiv preprint arXiv:2310.11220, 2023.
  35. Understanding catastrophic forgetting in language models via implicit inference. arXiv preprint arXiv:2309.10105, 2023.
  36. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  37. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  38. Halueval: A large-scale hallucination evaluation benchmark for large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  39. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  40. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, 2022.
  41. We’re afraid language models aren’t modeling ambiguity. arXiv preprint arXiv:2304.14399, 2023a.
  42. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  43. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  61–68, 2022.
  44. Unsupervised entity alignment for temporal knowledge graphs. In Proceedings of the ACM Web Conference 2023, pp.  2528–2538, 2023b.
  45. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2022.
  46. Reasoning on graphs: Faithful and interpretable large language model reasoning. arXiv preprint arXiv:2310.01061, 2023.
  47. Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv preprint arXiv:2307.11019, 2023.
  48. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023.
  49. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pp.  697–706, 2007.
  50. Head-to-tail: How knowledgeable are large language models (llm)? aka will llms replace knowledge graphs? arXiv preprint arXiv:2308.10168, 2023.
  51. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197, 2019.
  52. Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. In International Semantic Web Conference, pp.  348–367. Springer, 2023.
  53. Nn-baton: Dnn workload orchestration and chiplet granularity exploration for multichip accelerators. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp.  1013–1026, 2021.
  54. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  55. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023.
  56. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  57. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214, 2023.
  58. Evaluating open-QA evaluation. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023a.
  59. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521, 2023b.
  60. Blendfilter: Advancing retrieval-augmented large language models via query generation blending and knowledge filtering, 2024a.
  61. ” my answer is c”: First-token probabilities do not match text answers in instruction-tuned language models. arXiv preprint arXiv:2402.14499, 2024b.
  62. Preserving in-context learning ability in large language model fine-tuning. arXiv preprint arXiv:2211.00635, 2022.
  63. ” according to…” prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252, 2023.
  64. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172, 2023.
  65. Deep bidirectional language-knowledge graph pretraining. volume 35, pp.  37309–37323, 2022.
  66. Do large language models know what they don’t know? In Findings of the Association for Computational Linguistics: ACL 2023, pp.  8653–8665, 2023.
  67. Investigating the catastrophic forgetting in multimodal large language models. arXiv preprint arXiv:2309.10313, 2023.
  68. Knowledge graph enhanced large language model editing. arXiv preprint arXiv:2402.13593, 2024.
  69. Knowledgeable preference alignment for llms in domain-specific question answering. arXiv preprint arXiv:2311.06503, 2023a.
  70. Making large language models perform better in knowledge graph completion. arXiv preprint arXiv:2310.06671, 2023b.
  71. Siren’s song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023c.
  72. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiaoze Liu (22 papers)
  2. Feijie Wu (14 papers)
  3. Tianyang Xu (53 papers)
  4. Zhuo Chen (319 papers)
  5. Yichi Zhang (184 papers)
  6. Xiaoqian Wang (34 papers)
  7. Jing Gao (98 papers)
Citations (3)