Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge (2407.20564v1)

Published 30 Jul 2024 in cs.CL

Abstract: While LLMs have demonstrated impressive capabilities across various natural language processing tasks by acquiring rich factual knowledge from their broad training data, their ability to synthesize and logically reason with this knowledge in complex ways remains underexplored. In this work, we present a systematic evaluation of state-of-the-art LLMs' complex logical reasoning abilities through a novel benchmark of automatically generated complex reasoning questions over general domain and biomedical knowledge graphs. Our extensive experiments, employing diverse in-context learning techniques, reveal that LLMs excel at reasoning over general world knowledge but face significant challenges with specialized domain-specific knowledge. We find that prompting with explicit Chain-of-Thought demonstrations can substantially improve LLM performance on complex logical reasoning tasks with diverse logical operations. Interestingly, our controlled evaluations uncover an asymmetry where LLMs display proficiency at set union operations, but struggle considerably with set intersections - a key building block of logical reasoning. To foster further work, we will publicly release our evaluation benchmark and code.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Complex query answering with neural link predictors. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Mos9F9kDwkz.
  2. Complex query answering on eventuality knowledge graph with implicit logical constraints, 2023a.
  3. Sequential query encoding for complex query answering on knowledge graphs. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856. URL https://openreview.net/forum?id=ERqGqZzSu5.
  4. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare Voss (eds.), Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp.  65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. URL https://aclanthology.org/W05-0909.
  5. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity, 2023.
  6. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference, 2008. URL https://api.semanticscholar.org/CorpusID:207167677.
  7. Language models are few-shot learners, 2020. URL https://arxiv.org/abs/2005.14165.
  8. Toward an architecture for never-ending language learning. In AAAI Conference on Artificial Intelligence, 2010. URL https://api.semanticscholar.org/CorpusID:8423494.
  9. Building a knowledge graph to enable precision medicine. Scientific Data, 10(1):67, 2023. URL https://doi.org/10.1038/s41597-023-01960-3.
  10. Learning to teach large language models logical reasoning, 2023a.
  11. Felm: Benchmarking factuality evaluation of large language models, 2023b.
  12. Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domain scenarios, 2023.
  13. A survey on in-context learning, 2023.
  14. Compositional semantic parsing with large language models, 2022.
  15. Successive prompting for decomposing complex questions, 2022.
  16. Complexity-based prompting for multi-step reasoning, 2023.
  17. Embedding logical queries on knowledge graphs, 2019.
  18. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7856–7870, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.619. URL https://aclanthology.org/2021.emnlp-main.619.
  19. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  1049–1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.67. URL https://aclanthology.org/2023.findings-acl.67.
  20. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023.
  21. Ufo: a unified and flexible framework for evaluating factuality of large language models, 2024.
  22. Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
  23. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021.
  24. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  25. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  26. Systematic assessment of factual knowledge in large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  13272–13286, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.885. URL https://aclanthology.org/2023.findings-emnlp.885.
  27. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11048–11064, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.759.
  28. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  12076–12100, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
  29. Mistral. Mistral 7b | frontier ai in your hands, September 2023a. URL https://mistral.ai/news/announcing-mistral-7b/.
  30. Mistral. Mixtral of experts | frontier ai in your hands, December 2023b. URL https://mistral.ai/news/mixtral-of-experts/.
  31. OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/chatgpt, 2022. Accessed: 2024-02-15.
  32. OpenAI. Hello, gpt-4o!, 2024. URL https://openai.com/index/hello-gpt-4o/.
  33. Gpt-4 technical report, 2023.
  34. Training language models to follow instructions with human feedback, 2022.
  35. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin (eds.), Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  36. Beta embeddings for multi-hop logical reasoning in knowledge graphs, 2020.
  37. Query2box: Reasoning over knowledge graphs in vector space using box embeddings, 2020.
  38. Yago: A core of semantic knowledge unifying wordnet and wikipedia. In The Web Conference, 2007. URL https://api.semanticscholar.org/CorpusID:207163173.
  39. Head-to-tail: How knowledgeable are large language models (llm)? a.k.a. will llms replace knowledge graphs?, 2023.
  40. Head-to-tail: How knowledgeable are large language models (llms)? a.k.a. will llms replace knowledge graphs?, 2024. URL https://arxiv.org/abs/2308.10168.
  41. Fever: a large-scale dataset for fact extraction and verification, 2018.
  42. Observed versus latent features for knowledge base and text inference. In Alexandre Allauzen, Edward Grefenstette, Karl Moritz Hermann, Hugo Larochelle, and Scott Wen-tau Yih (eds.), Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp.  57–66, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.18653/v1/W15-4007. URL https://aclanthology.org/W15-4007.
  43. Llama 2: Open foundation and fine-tuned chat models, 2023.
  44. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html.
  45. Asking and answering questions to evaluate the factual consistency of summaries. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5008–5020, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.450. URL https://aclanthology.org/2020.acl-main.450.
  46. Benchmarking the combinatorial generalizability of complex query answering on knowledge graphs. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  47. Finetuned language models are zero-shot learners, 2022a.
  48. Emergent abilities of large language models, 2022b.
  49. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  50. Large language models are better reasoners with self-verification, 2023.
  51. William E. Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. In Proceedings of the Section on Survey Research Methods, pp.  354–359. American Statistical Association, 1990.
  52. Rethinking complex queries on knowledge graphs with neural link predictors, 2023.
  53. Aser: Towards large-scale commonsense knowledge acquisition via higher-order selectional preference over eventualities, 2022.
  54. Aser: Towards large-scale commonsense knowledge acquisition via higher-order selectional preference over eventualities, 2022.
  55. Calibrate before use: Improving few-shot performance of language models, 2021.
  56. Calibrate before use: Improving few-shot performance of language models, 2021.
  57. Least-to-most prompting enables complex reasoning in large language models, 2023.
  58. Teaching algorithmic reasoning via in-context learning, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Tianshi Zheng (19 papers)
  2. Jiaxin Bai (30 papers)
  3. Yicheng Wang (41 papers)
  4. Tianqing Fang (43 papers)
  5. Yue Guo (29 papers)
  6. Yauwai Yim (8 papers)
  7. Yangqiu Song (196 papers)
Citations (1)