Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do Large Language Models Latently Perform Multi-Hop Reasoning? (2402.16837v1)

Published 26 Feb 2024 in cs.CL
Do Large Language Models Latently Perform Multi-Hop Reasoning?

Abstract: We study whether LLMs latently perform multi-hop reasoning with complex prompts such as "The mother of the singer of 'Superstition' is". We look for evidence of a latent reasoning pathway where an LLM (1) latently identifies "the singer of 'Superstition'" as Stevie Wonder, the bridge entity, and (2) uses its knowledge of Stevie Wonder's mother to complete the prompt. We analyze these two hops individually and consider their co-occurrence as indicative of latent multi-hop reasoning. For the first hop, we test if changing the prompt to indirectly mention the bridge entity instead of any other entity increases the LLM's internal recall of the bridge entity. For the second hop, we test if increasing this recall causes the LLM to better utilize what it knows about the bridge entity. We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual, varying across different types of prompts. Also, on average, the evidence for the second hop and the full multi-hop traversal is rather moderate and only substantial for the first hop. Moreover, we find a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop. Our experimental findings suggest potential challenges and opportunities for future development and applications of LLMs.

This paper investigates whether LLMs perform latent multi-hop reasoning when processing complex prompts that require connecting two pieces of information. For example, to answer "The mother of the singer of 'Superstition' is", an LLM would ideally (1) identify "the singer of 'Superstition'" as Stevie Wonder (the bridge entity) and (2) then use its knowledge of Stevie Wonder's mother to provide the final answer. The research analyzes these two "hops" separately and their co-occurrence to find evidence of such latent reasoning.

To facilitate this paper, the authors introduce the TwoHopFact dataset, comprising 45,595 two-hop and corresponding one-hop prompts across 52 types of fact compositions, derived from Wikidata. Experiments are conducted on LLaMA-2 models of 7B, 13B, and 70B parameters.

First Hop: Entity Recall

The first hop involves the LLM recalling the bridge entity from its descriptive mention.

  • Metric: The paper proposes the Internal Entity Recall Score (EntRec). This score is calculated by taking the hidden representation at the last token of the descriptive mention (e.g., "the singer of 'Superstition'") in a specific transformer layer ll, applying layer normalization, projecting it to the vocabulary space using the unembedding matrix WUW_U, and then taking the log probability of the first token of the bridge entity's name (e.g., "Stevie").

    $EntRec^l(\tau_{2hop}, e_b) = \log \softmax(\text{LayerNorm}(h_{\text{mention_end}}^l) W_U)_{\text{index}(e_{b\text{\_first\_token})}$

  • Experiment: To test this, prompts are modified. For a prompt τ2hop\tau_{2hop} like "The mother of the singer of 'Superstition' is", a counterpart τ2hop\tau'_{2hop} is created where the descriptive mention does not refer to the bridge entity (e.g., "The mother of the singer of 'Thriller' is" (entity substitution) or "The mother of a plagiarist of 'Superstition' is" (relation substitution)). The paper measures how often EntRecl(τ2hop,eb)>EntRecl(τ2hop,eb)EntRec^l(\tau_{2hop}, e_b) > EntRec^l(\tau'_{2hop}, e_b). A frequency > 0.5 indicates successful first-hop reasoning.

Findings for First Hop:

  • There is substantial evidence for the first hop. For LLaMA-2 70B, changing the prompt to correctly (indirectly) mention the bridge entity increased its recall in later layers up to 78% of the time (for entity substitution) and 76% (for relation substitution).
  • This ability clearly improves with increasing model size.
  • Performance is contextual, with some fact composition types (e.g., #1{president of anthem's country}) showing very high recall rates (e.g., 100% for 70B with entity substitution).

Second Hop: Knowledge Utilization

The second hop involves the LLM utilizing its knowledge about the recalled bridge entity to answer the overall prompt.

  • Metric: The paper introduces the Consistency Score (CnstScore). This measures the similarity between the LLM's output probability distribution for the two-hop prompt (pτ2hop\mathbf{p}_{\tau_{2hop}}) and the corresponding one-hop prompt involving the bridge entity (pτ1hop\mathbf{p}_{\tau_{1hop}}, e.g., "The mother of Stevie Wonder is"). It's defined as the symmetric cross-entropy:

    CnstScore(τ2hop,τ1hop)=0.5H(pτ2hop,pτ1hop)0.5H(pτ1hop,pτ2hop)CnstScore(\tau_{2hop}, \tau_{1hop}) = -0.5 \cdot \text{H}(\mathbf{p}_{\tau_{2hop}}, \mathbf{p}_{\tau_{1hop}}) - 0.5 \cdot \text{H}(\mathbf{p}_{\tau_{1hop}}, \mathbf{p}_{\tau_{2hop}})

    Higher scores indicate greater consistency.

  • Experiment: The paper investigates if increasing the EntRec for the bridge entity (by intervening on its hidden representation $h^l_{\text{mention_end}}$) leads to an increase in CnstScore. This is done by nudging $h^l_{\text{mention_end}}$ in the direction of hlEntRec\nabla_{h^l} EntRec and observing the sign of the derivative of CnstScore with respect to this change ddαCnstScore(α)α=0\left. \frac{d}{d\alpha}CnstScore(\alpha) \right|_{\alpha=0}. A positive derivative in more than 50% of cases suggests successful second-hop utilization.

Findings for Second Hop:

  • Evidence for the second hop is moderate. For LLaMA-2 7B, increasing bridge entity recall led to increased consistency in up to 64% of cases.
  • Unlike the first hop, this ability does not consistently improve with increasing model size; the peak relative frequency remained around 0.61-0.65 across 7B, 13B, and 70B models.
  • Again, performance is contextual. Some types, like #1{founder of person's undergrad university}, showed strong second-hop evidence (around 80-86%) across model sizes.

Overall Latent Multi-Hop Reasoning

This is assessed by the co-occurrence of successful first-hop (entity recall improves with correct descriptive mention) and successful second-hop (increasing entity recall improves consistency).

  • Evidence is moderate. LLaMA-2 7B showed successful two-hop reasoning (SS: success-success) in up to 46% of cases (entity substitution) and 38% (relation substitution), above a 25% random baseline.
  • Scaling trends are mixed:
    • With entity substitution for the first hop, overall multi-hop reasoning did not clearly improve with model size.
    • With relation substitution for the first hop, there was a scaling trend (e.g., from 38% for 7B to 43% for 70B).
  • Up to 23% of fact composition types showed strong multi-hop reasoning (occurring >80% of the time, threshold adjusted for combined probability). For example, #1{anthem of capital's country} showed consistently high performance.

Discussion and Conclusion:

The paper finds that while LLMs can perform latent multi-hop reasoning, especially for certain types of prompts, this ability is highly contextual. The first hop (recalling the bridge entity) scales well with model size, but the second hop (utilizing knowledge about that entity) does not show similar scaling. This suggests potential limitations in current LLM architectures or pretraining paradigms for consistently performing complex, multi-step latent reasoning. The findings imply that simply scaling models might not be enough to improve all aspects of reasoning and that future work might need to explore different pretraining objectives, data, or architectures. The paper highlights challenges for model editing (if complex facts are merely recalled, editing base facts won't propagate) and opportunities for developing more parameter-efficient and controllable models by strengthening these reasoning pathways.

Limitations Acknowledged:

  • The paper focuses on one specific pathway for reasoning; others might exist.
  • Analysis is layer-specific, while effects might be distributed. Thus, results might be a lower bound.
  • The TwoHopFact dataset, while curated, can have noise from Wikidata and the difficulty of defining "only" or "most famous" objects for relations.
  • EntRec uses only the first token of an entity and relies on logit lens, which has known shortcomings, though the authors argue these have minimal impact on their comparative analysis.
Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. What learning algorithm is in-context learning? investigations with linear models. In ICLR.
  2. Zeyuan Allen-Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.2, knowledge manipulation. arXiv.
  3. Akari Asai and Hannaneh Hajishirzi. 2020. Logic-guided data augmentation and regularization for consistent question answering. In ACL.
  4. Eliciting latent predictions from transformers with the tuned lens. arXiv.
  5. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. In ICLR.
  6. Language models are few-shot learners. In NeurIPS.
  7. Data distributional properties drive emergent in-context learning in transformers. In NeurIPS.
  8. Identifying linear relational concepts in large language models. arXiv.
  9. Evaluating the ripple effects of knowledge editing in language models. arXiv.
  10. Towards automated circuit discovery for mechanistic interpretability. In NeurIPS.
  11. Why can GPT learn in-context? language models secretly perform gradient descent as meta-optimizers. In Findings of ACL.
  12. Editing factual knowledge in language models. In EMNLP.
  13. Jump to conclusions: Short-cutting transformers with linear transformations. arXiv.
  14. Faith and fate: Limits of transformers on compositionality. In NeurIPS.
  15. Measuring and improving consistency in pretrained language models. TACL.
  16. Jiahai Feng and Jacob Steinhardt. 2024. How do language models bind entities in context? In ICLR.
  17. Dissecting recall of factual associations in auto-regressive language models. In EMNLP.
  18. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In EMNLP.
  19. Transformer feed-forward layers are key-value memories. In EMNLP.
  20. Linearity of relation decoding in transformer language models. In ICLR.
  21. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. In ACL.
  22. Know how to make up your mind! adversarially detecting and alleviating inconsistencies in natural language explanations. In ACL.
  23. Language models with rationality. In EMNLP.
  24. BeliefBank: Adding memory to a pre-trained language model for a systematic notion of belief. In EMNLP.
  25. Transformer language models handle word frequency in prediction head. In ACL.
  26. A logic-driven framework for consistency of neural models. In EMNLP.
  27. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv.
  28. The hydra effect: Emergent self-repair in language model computations. arXiv.
  29. Locating and editing factual associations in GPT. In NeurIPS.
  30. Fast model editing at scale. In ICLR.
  31. Neel Nanda and Joseph Bloom. 2022. Transformerlens. https://github.com/neelnanda-io/TransformerLens.
  32. Progress measures for grokking via mechanistic interpretability. In ICLR.
  33. nostalgebraist. 2020. interpreting gpt: the logit lens.
  34. Measuring and narrowing the compositionality gap in language models. In Findings of EMNLP.
  35. In-context learning and induction heads. arXiv.
  36. Can LMs learn new entities from descriptions? challenges in propagating injected knowledge. In ACL.
  37. Gpt-4 technical report. arXiv.
  38. Language models as knowledge bases? In EMNLP.
  39. Ben Prystawski and Noah D Goodman. 2023. Why think step-by-step? reasoning emerges from the locality of experience. In NeurIPS.
  40. Are red roses red? evaluating consistency of question-answering models. In ACL.
  41. Memory injections: Correcting multi-hop reasoning failures during inference in transformer-based language models. arXiv.
  42. Testing the general deductive reasoning capacity of large language models using OOD examples. In NeurIPS.
  43. William Timkey and Marten van Schijndel. 2021. All bard and no bite: Rogue dimensions in transformer language models obscure representational quality. In EMNLP.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv.
  45. Attention is all you need. In NeurIPS.
  46. Transformers learn in-context by gradient descent. In ICML.
  47. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM.
  48. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In ICLR.
  49. Emergent abilities of large language models. TMLR.
  50. Chain of thought prompting elicits reasoning in large language models. In NeurIPS.
  51. Constructing datasets for multi-hop reading comprehension across documents. TACL.
  52. Huggingface’s transformers: State-of-the-art natural language processing. arXiv.
  53. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
  54. A comprehensive study of knowledge editing for large language models. arXiv.
  55. MQAKE: Assessing knowledge editing in language models via multi-hop questions. In EMNLP.
  56. Least-to-most prompting enables complex reasoning in large language models. In ICLR.
  57. Zeyuan Allen Zhu and Yuanzhi Li. 2023. Physics of language models: Part 3.1, knowledge storage and extraction. arXiv.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sohee Yang (23 papers)
  2. Elena Gribovskaya (9 papers)
  3. Nora Kassner (22 papers)
  4. Mor Geva (58 papers)
  5. Sebastian Riedel (140 papers)
Citations (46)