Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Hop Knowledge Graph Reasoning with Reward Shaping (1808.10568v2)

Published 31 Aug 2018 in cs.AI, cs.CL, and cs.LG

Abstract: Multi-hop reasoning is an effective approach for query answering (QA) over incomplete knowledge graphs (KGs). The problem can be formulated in a reinforcement learning (RL) setup, where a policy-based agent sequentially extends its inference path until it reaches a target. However, in an incomplete KG environment, the agent receives low-quality rewards corrupted by false negatives in the training data, which harms generalization at test time. Furthermore, since no golden action sequence is used for training, the agent can be misled by spurious search trajectories that incidentally lead to the correct answer. We propose two modeling advances to address both issues: (1) we reduce the impact of false negative supervision by adopting a pretrained one-hop embedding model to estimate the reward of unobserved facts; (2) we counter the sensitivity to spurious paths of on-policy RL by forcing the agent to explore a diverse set of paths using randomly generated edge masks. Our approach significantly improves over existing path-based KGQA models on several benchmark datasets and is comparable or better than embedding-based models.

This paper addresses the problem of multi-hop knowledge graph (KG) reasoning for query answering (QA), where the goal is to find target entities eoe_o given a source entity ese_s and a query relation rqr_q, such that (es,rq,eo)(e_s, r_q, e_o) is a valid but potentially unobserved fact in the KG. The authors frame this as a sequential decision process solved with reinforcement learning (RL), building upon the "learning to walk" approach proposed by MINERVA (Kondapally et al., 2018 ).

The core challenges in this RL setup are:

  1. False Negative Supervision: The training KG is incomplete, meaning a path might lead to a correct answer (es,rq,eo)(e_s, r_q, e_o) that exists in the full KG but is not present in the training data. In the standard setup, this receives zero reward, incorrectly penalizing potentially good search paths.
  2. Spurious Paths: The agent might discover paths that incidentally lead to a correct answer but are irrelevant to the query relation (e.g., a path unrelated to bornIn coincidentally connecting Obama and Hawaii). Since the policy gradient method (REINFORCE) is on-policy, it can be biased towards these spurious paths found early in training.

To address these issues, the authors propose two main modeling advances:

  1. Knowledge-Based Reward Shaping (RS): Instead of a binary reward based solely on the training KG, they introduce a soft reward signal for entities not found in the training set. This soft reward is estimated using a pre-trained KG embedding model (like ComplEx or ConvE). For a potential target entity eTe_T, the reward R(sT)R(s_T) at the final state sTs_T is defined as:

    R(sT)=Rb(sT)+(1Rb(sT))f(es,rq,eT)R(s_T) = R_b(s_T) + (1 - R_b(s_T))f(e_s, r_q, e_T)

    where Rb(sT)R_b(s_T) is the binary reward (1 if (es,rq,eT)(e_s, r_q, e_T) is in the training KG, 0 otherwise) and f(es,rq,eT)f(e_s, r_q, e_T) is the score from the pre-trained embedding model. This allows the agent to receive partial credit for reaching potentially correct but unobserved answers, mitigating the false negative problem. The embedding model is pre-trained and its parameters are fixed during the RL training.

  2. Action Dropout (AD): To counter the policy's tendency to converge prematurely to spurious paths and encourage better exploration, the authors propose randomly masking a fraction of outgoing edges (actions) at each step during the training process. This is implemented by perturbing the agent's action probability distribution πθ(atst)\pi_{\theta}(a_t|s_t) with a random binary mask mm:

    π~θ(atst)πθ(atst)m+ϵ\tilde{\pi}_{\theta}(a_t|s_t) \propto \pi_{\theta}(a_t|s_t) \cdot m + \epsilon

    where mim_i is sampled from a Bernoulli distribution with probability 1α1-\alpha (where α\alpha is the dropout rate), and ϵ\epsilon is a small smoothing constant. This forces the agent to explore a more diverse set of paths beyond the currently highest-scoring ones, making the learned policy more robust to spurious correlations.

Implementation Details:

The walk-based QA is formulated as an MDP:

  • State (sts_t): The current entity ete_t and the query (es,rq)(e_s, r_q).
  • Action (ata_t): An outgoing edge (r,e)(r', e') from ete_t. Actions are represented by concatenating the relation and target entity embeddings. A self-loop action allows termination.
  • Policy Network (πθ\pi_{\theta}): An LSTM encodes the history of traversed entities and relations. An MLP takes the current entity embedding, query relation embedding, and the LSTM hidden state to score possible actions (outgoing edges) at the current entity.
  • Training: The policy network is trained using the REINFORCE algorithm to maximize the expected cumulative reward, incorporating the knowledge-based reward shaping. For queries with multiple correct answers, each (es,rq,eoi)(e_s, r_q, e_{o_i}) pair is treated as a separate training instance, and other known correct answers eoje_{o_j} (for jij \neq i) are masked during the last step's target selection to force the agent to walk towards eoie_{o_i}.
  • Decoding: Beam search is used to find a set of promising paths. Multiple paths can lead to the same entity; the score for a unique entity is the maximum score among paths reaching it.
  • KG Handling: KGs are augmented with inverse relations. Node fan-out is limited by a threshold η\eta to manage computational cost.
  • Hyperparameters: Key hyperparameters include embedding size, LSTM hidden size, reasoning path length, η\eta, action dropout rate α\alpha, entropy regularization weight, learning rate, batch size, and dropout rates for network layers. The optimal α\alpha is found to correlate positively with KG density.

Experimental Results and Analysis:

The approach is evaluated on five benchmark KGs: Kinship, UMLS, FB15k-237, WN18RR, and NELL-995.

  • The proposed model (Ours), using either ComplEx or ConvE for reward shaping, significantly outperforms previous multi-hop reasoning methods like MINERVA, NeuralLP, and NTP-λ\lambda on most datasets (UMLS, Kinship, FB15k-237).
  • It achieves performance consistently comparable to or better than strong embedding-based baselines (DistMult, ComplEx, ConvE), a key contribution as prior path-based methods often lagged behind embeddings on certain datasets.
  • An ablation paper shows that both reward shaping (-RS) and action dropout (-AD) contribute significantly to performance improvement, although their relative impact varies across datasets. Removing action dropout generally causes a larger performance drop.
  • Analysis of training convergence shows that both RS and AD lead to faster convergence to higher performance levels. AD, in particular, improves performance immediately.
  • Examining path diversity reveals that AD substantially increases the number of unique paths explored during training. While RS slightly decreases the number of unique paths (acting as a guide), the combination yields the best results, suggesting guided exploration is key.
  • Performance analysis by relation type (to-many vs. to-one) shows that the proposed techniques are generally more effective for relations with multiple answers (to-many).
  • Analysis by query type (seen vs. unseen in training data) suggests that RS and AD improve performance on unseen queries, with AD being particularly beneficial for generalization.

Practical Implications:

  • The combination of symbolic path search with knowledge from continuous embeddings via reward shaping is a powerful technique for leveraging complementary strengths.
  • Action dropout provides a simple yet effective mechanism to encourage exploration and improve the robustness of on-policy RL in structured environments like KGs, mitigating the impact of spurious paths.
  • The approach demonstrates that multi-hop reasoning models can achieve state-of-the-art performance competitive with embedding-based models, offering better interpretability through the learned paths.
  • Implementation requires integrating standard components like LSTMs, MLPs, and pre-trained embedding models within an RL training loop (REINFORCE). Careful hyperparameter tuning, especially the action dropout rate based on KG structure (density), is important. Handling the large action space necessitates techniques like neighbor pruning.
  • The method offers a way to perform inference by exploring interpretable paths, which can be valuable for debugging or explaining model predictions.

The paper concludes that access to a more accurate environment representation (via RS) and thorough exploration (via AD) are crucial for high-performance RL-based KGQA. Future work could explore learnable reward shaping and action dropout, and integrating model-based RL techniques.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Xi Victoria Lin (39 papers)
  2. Richard Socher (115 papers)
  3. Caiming Xiong (337 papers)
Citations (318)