Evaluation of LLMs Through Multi-Hop Reasoning and Knowledge Editing
The paper "MRKE: The Multi-hop Reasoning Evaluation of LLMs by Knowledge Edition" addresses an essential dimension of evaluating LLMs: their multi-hop reasoning ability in question answering tasks. Despite the pronounced capabilities of LLMs in multi-hop question answering (MHQA) scenarios, the authors argue that these models' genuine reasoning abilities remain inadequately explored due to conventional evaluation limitations. The authors introduce a novel benchmark, MRKE, intended to overcome current limitations by editing the established HotpotQA dataset and integrating evaluation of reasoning chains.
Key Contributions and Findings
The primary contributions of this research include the development of a new MHQA benchmark by editing HotpotQA to incorporate new and unprecedented knowledge through knowledge editing. This approach intends to mitigate challenges such as data contamination, which arises when evaluation datasets may have been exposed to the models during pretraining, thereby potentially inflating model performance metrics. Additionally, this work emphasizes the assessment of reasoning chains via sub-questions and their corresponding intermediate answers.
- Performance Gap: The paper finds a notable performance disparity when models are evaluated on the edited dataset compared to the original HotpotQA. For instance, GPT-4 demonstrates significantly reduced exact match (EM) and F1 scores on the MRKE dataset (53.2 EM and 67.7 F1 on two-hop questions) as opposed to the original HotpotQA dataset (69.3 EM and 82.2 F1). This performance gap underscores the potential data contamination in traditional MHQA datasets and suggests that LLMs' reasoning abilities are being overstated.
- Reasoning Chain Evaluation: The paper introduces reasoning chain evaluation to assess whether LLMs follow the correct reasoning process to arrive at their answers. For example, GPT-4 only manages a 36.3% accuracy on the correct reasoning chain across the dataset. This finding implies that while LLMs may arrive at correct answers, they do not consistently follow the accurate reasoning path, indicating a reliance on memory or heuristic shortcuts.
- Joint Evaluation Metric: To better capture the interplay of question complexity and reasoning capabilities, the paper proposes a new evaluation metric that combines intermediate and final answer assessment. The results reveal that models experience diminished performance with increasing multi-hop complexity, highlighting the need for more robust reasoning strategies in LLMs.
Implications
The MRKE benchmark represents a significant step toward more accurately evaluating LLMs' reasoning abilities in MHQA tasks. By isolating reasoning capabilities from memorization, the paper provides a new lens to understand and improve LLM performance. The findings suggest a critical need for developing techniques and models that enhance reasoning pathways rather than merely improving final answer generation.
Future Directions
Given the highlighted discrepancies in reasoning abilities, future work should focus on developing LLM architectures and training regimes that emphasize reasoning chains and procedural correctness. Additionally, extending the scope of MRKE and similar benchmarks to other LLMs and domains could further elucidate the reasoning dynamics at play. Investigating methods to dynamically update benchmarks in response to LLMs' evolving training datasets could also mitigate the risk of data contamination.
In summary, the paper offers a compelling methodology for evaluating the reasoning capabilities of LLMs using a refined multi-hop QA approach, presenting important insights into both current limitations and future prospects in LLM development.