Do Large Language Models Latently Perform Multi-Hop Reasoning? (2402.16837v1)

Published 26 Feb 2024 in cs.CL

Abstract: We study whether LLMs latently perform multi-hop reasoning with complex prompts such as "The mother of the singer of 'Superstition' is". We look for evidence of a latent reasoning pathway where an LLM (1) latently identifies "the singer of 'Superstition'" as Stevie Wonder, the bridge entity, and (2) uses its knowledge of Stevie Wonder's mother to complete the prompt. We analyze these two hops individually and consider their co-occurrence as indicative of latent multi-hop reasoning. For the first hop, we test if changing the prompt to indirectly mention the bridge entity instead of any other entity increases the LLM's internal recall of the bridge entity. For the second hop, we test if increasing this recall causes the LLM to better utilize what it knows about the bridge entity. We find strong evidence of latent multi-hop reasoning for the prompts of certain relation types, with the reasoning pathway used in more than 80% of the prompts. However, the utilization is highly contextual, varying across different types of prompts. Also, on average, the evidence for the second hop and the full multi-hop traversal is rather moderate and only substantial for the first hop. Moreover, we find a clear scaling trend with increasing model size for the first hop of reasoning but not for the second hop. Our experimental findings suggest potential challenges and opportunities for future development and applications of LLMs.

PDF HTML Abstract

This paper investigates whether LLMs perform latent multi-hop reasoning when processing complex prompts that require connecting two pieces of information. For example, to answer "The mother of the singer of 'Superstition' is", an LLM would ideally (1) identify "the singer of 'Superstition'" as Stevie Wonder (the bridge entity) and (2) then use its knowledge of Stevie Wonder's mother to provide the final answer. The research analyzes these two "hops" separately and their co-occurrence to find evidence of such latent reasoning.

To facilitate this paper, the authors introduce the TwoHopFact dataset, comprising 45,595 two-hop and corresponding one-hop prompts across 52 types of fact compositions, derived from Wikidata. Experiments are conducted on LLaMA-2 models of 7B, 13B, and 70B parameters.

First Hop: Entity Recall

The first hop involves the LLM recalling the bridge entity from its descriptive mention.

Metric: The paper proposes the Internal Entity Recall Score (EntRec). This score is calculated by taking the hidden representation at the last token of the descriptive mention (e.g., "the singer of 'Superstition'") in a specific transformer layer $l$ , applying layer normalization, projecting it to the vocabulary space using the unembedding matrix $W_U$ , and then taking the log probability of the first token of the bridge entity's name (e.g., "Stevie").

$EntRec^l(\tau_{2hop}, e_b) = \log \softmax(\text{LayerNorm}(h_{\text{mention_end}}^l) W_U)_{\text{index}(e_{b\text{\_first\_token})}$
Experiment: To test this, prompts are modified. For a prompt $\tau_{2hop}$ like "The mother of the singer of 'Superstition' is", a counterpart $\tau'_{2hop}$ is created where the descriptive mention does not refer to the bridge entity (e.g., "The mother of the singer of 'Thriller' is" (entity substitution) or "The mother of a plagiarist of 'Superstition' is" (relation substitution)). The paper measures how often $EntRec^l(\tau_{2hop}, e_b) > EntRec^l(\tau'_{2hop}, e_b)$ . A frequency > 0.5 indicates successful first-hop reasoning.

Findings for First Hop:

There is substantial evidence for the first hop. For LLaMA-2 70B, changing the prompt to correctly (indirectly) mention the bridge entity increased its recall in later layers up to 78% of the time (for entity substitution) and 76% (for relation substitution).
This ability clearly improves with increasing model size.
Performance is contextual, with some fact composition types (e.g., #1{president of anthem's country}) showing very high recall rates (e.g., 100% for 70B with entity substitution).

Second Hop: Knowledge Utilization

The second hop involves the LLM utilizing its knowledge about the recalled bridge entity to answer the overall prompt.

Metric: The paper introduces the Consistency Score (CnstScore). This measures the similarity between the LLM's output probability distribution for the two-hop prompt ( $\mathbf{p}_{\tau_{2hop}}$ ) and the corresponding one-hop prompt involving the bridge entity ( $\mathbf{p}_{\tau_{1hop}}$ , e.g., "The mother of Stevie Wonder is"). It's defined as the symmetric cross-entropy:

$CnstScore(\tau_{2hop}, \tau_{1hop}) = -0.5 \cdot \text{H}(\mathbf{p}_{\tau_{2hop}}, \mathbf{p}_{\tau_{1hop}}) - 0.5 \cdot \text{H}(\mathbf{p}_{\tau_{1hop}}, \mathbf{p}_{\tau_{2hop}})$

Higher scores indicate greater consistency.
Experiment: The paper investigates if increasing the EntRec for the bridge entity (by intervening on its hidden representation $h^l_{\text{mention_end}}$) leads to an increase in CnstScore. This is done by nudging $h^l_{\text{mention_end}}$ in the direction of $\nabla_{h^l} EntRec$ and observing the sign of the derivative of CnstScore with respect to this change $\left. \frac{d}{d\alpha}CnstScore(\alpha) \right|_{\alpha=0}$ . A positive derivative in more than 50% of cases suggests successful second-hop utilization.

Findings for Second Hop:

Evidence for the second hop is moderate. For LLaMA-2 7B, increasing bridge entity recall led to increased consistency in up to 64% of cases.
Unlike the first hop, this ability does not consistently improve with increasing model size; the peak relative frequency remained around 0.61-0.65 across 7B, 13B, and 70B models.
Again, performance is contextual. Some types, like #1{founder of person's undergrad university}, showed strong second-hop evidence (around 80-86%) across model sizes.

Overall Latent Multi-Hop Reasoning

This is assessed by the co-occurrence of successful first-hop (entity recall improves with correct descriptive mention) and successful second-hop (increasing entity recall improves consistency).

Evidence is moderate. LLaMA-2 7B showed successful two-hop reasoning (SS: success-success) in up to 46% of cases (entity substitution) and 38% (relation substitution), above a 25% random baseline.
Scaling trends are mixed:
- With entity substitution for the first hop, overall multi-hop reasoning did not clearly improve with model size.
- With relation substitution for the first hop, there was a scaling trend (e.g., from 38% for 7B to 43% for 70B).
Up to 23% of fact composition types showed strong multi-hop reasoning (occurring >80% of the time, threshold adjusted for combined probability). For example, #1{anthem of capital's country} showed consistently high performance.

Discussion and Conclusion:

The paper finds that while LLMs can perform latent multi-hop reasoning, especially for certain types of prompts, this ability is highly contextual. The first hop (recalling the bridge entity) scales well with model size, but the second hop (utilizing knowledge about that entity) does not show similar scaling. This suggests potential limitations in current LLM architectures or pretraining paradigms for consistently performing complex, multi-step latent reasoning. The findings imply that simply scaling models might not be enough to improve all aspects of reasoning and that future work might need to explore different pretraining objectives, data, or architectures. The paper highlights challenges for model editing (if complex facts are merely recalled, editing base facts won't propagate) and opportunities for developing more parameter-efficient and controllable models by strengthening these reasoning pathways.

Limitations Acknowledged:

The paper focuses on one specific pathway for reasoning; others might exist.
Analysis is layer-specific, while effects might be distributed. Thus, results might be a lower bound.
The TwoHopFact dataset, while curated, can have noise from Wikidata and the difficulty of defining "only" or "most famous" objects for relations.
EntRec uses only the first token of an entity and relies on logit lens, which has known shortcomings, though the authors argue these have minimal impact on their comparative analysis.