Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study (2504.16414v1)

Published 23 Apr 2025 in cs.CL

Abstract: In this study, we introduced a new benchmark consisting of a curated dataset and a defined evaluation process to assess the compositional reasoning capabilities of LLMs within the chemistry domain. We designed and validated a fully automated pipeline, verified by subject matter experts, to facilitate this task. Our approach integrates OpenAI reasoning models with named entity recognition (NER) systems to extract chemical entities from recent literature, which are then augmented with external knowledge bases to form a comprehensive knowledge graph. By generating multi-hop questions across these graphs, we assess LLM performance in both context-augmented and non-context augmented settings. Our experiments reveal that even state-of-the-art models face significant challenges in multi-hop compositional reasoning. The results reflect the importance of augmenting LLMs with document retrieval, which can have a substantial impact on improving their performance. However, even perfect retrieval accuracy with full context does not eliminate reasoning errors, underscoring the complexity of compositional reasoning. This work not only benchmarks and highlights the limitations of current LLMs but also presents a novel data generation pipeline capable of producing challenging reasoning datasets across various domains. Overall, this research advances our understanding of reasoning in computational linguistics.

PDF Abstract

Evaluating Multi-Hop Reasoning in LLMs: A Chemistry-Centric Case Study

The paper "Evaluating Multi-Hop Reasoning in LLMs: A Chemistry-Centric Case Study" offers a comprehensive examination of LLMs in executing complex reasoning tasks within the domain of chemistry. Despite the broad performance excellence of LLMs on various linguistic tasks, their proficiency in multi-step reasoning, particularly in scientific fields requiring compositional reasoning such as chemistry, remains an area of active research and development.

Benchmarking Approach

The authors introduce a new benchmark that combines a curated dataset with a formal evaluation protocol designed to scrutinize the compositional reasoning capabilities of LLMs in chemistry. This framework employs an automated pipeline, reinforced with validation by subject matter experts, which integrates OpenAI models with named entity recognition (NER) systems. Through these systems, chemical entities are extracted and augmented with data from external knowledge bases to create a robust knowledge graph. Multi-hop questions are generated from this graph, enabling the assessment of LLM performance both in scenarios where context is supplied and where it is absent.

Key Findings

One of the salient findings of the paper is the substantial challenge faced by even advanced models in managing multi-hop compositional reasoning tasks effectively. Augmenting LLMs with document retrieval is shown to significantly impact performance positively. However, experiments demonstrate that even with perfect retrieval accuracy and full context, reasoning errors persist, emphasizing the intrinsic complexity associated with compositional reasoning in computational linguistics.

Model Performance and Comparisons

Extensive evaluations highlight the performance dynamics of various models, including OpenAI's o3-mini and gpt-4o, when furnished with contextual information versus when they rely solely on internal memory. In the context-provided setting, models like Claude Sonnet 3.7 with extended thinking capabilities excel in correctness, but at the cost of higher latency and computational expense. Conversely, models such as Llama 3.3 70B showcase lower costs and latency but falter in achieving high correctness rates.

Implications and Speculation

The implications of this research are twofold. Practically, it underscores the indispensable role of retrieval augmentation in enhancing LLM performance in domain-specific reasoning tasks, notably within chemistry. Theoretically, the paper corroborates the complexity of compositional reasoning, highlighting that current LLM capabilities might still be constrained when faced with multi-hop challenges that require synthesizing multiple sources of information.

Looking ahead, this work sets a foundational backdrop for future advancements in AI technologies. Developing models that can efficiently handle domain-specific reasoning tasks necessitates further refinement of retrieval systems and innovative strategies for integrating context over multiple reasoning steps. As retrieval-augmented generation models continue to evolve, their potential application in complex scientific domains persists as a promising frontier for AI development.

In summary, this paper significantly contributes to the understanding and evaluation of LLM capabilities in scientific reasoning, providing a novel framework and benchmark that can spur future research toward overcoming the current limitations in AI-driven compositional reasoning.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Mohammad Khodadad (5 papers)
Ali Shiraee Kasmaee (3 papers)
Mahdi Astaraki (1 paper)
Nicholas Sherck (1 paper)
Hamidreza Mahyar (18 papers)
Soheila Samiee (4 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos