This paper addresses the problem of multi-hop knowledge graph (KG) reasoning for query answering (QA), where the goal is to find target entities given a source entity and a query relation , such that is a valid but potentially unobserved fact in the KG. The authors frame this as a sequential decision process solved with reinforcement learning (RL), building upon the "learning to walk" approach proposed by MINERVA (Kondapally et al., 2018 ).
The core challenges in this RL setup are:
- False Negative Supervision: The training KG is incomplete, meaning a path might lead to a correct answer that exists in the full KG but is not present in the training data. In the standard setup, this receives zero reward, incorrectly penalizing potentially good search paths.
- Spurious Paths: The agent might discover paths that incidentally lead to a correct answer but are irrelevant to the query relation (e.g., a path unrelated to
bornIn
coincidentally connecting Obama and Hawaii). Since the policy gradient method (REINFORCE) is on-policy, it can be biased towards these spurious paths found early in training.
To address these issues, the authors propose two main modeling advances:
- Knowledge-Based Reward Shaping (RS): Instead of a binary reward based solely on the training KG, they introduce a soft reward signal for entities not found in the training set. This soft reward is estimated using a pre-trained KG embedding model (like ComplEx or ConvE). For a potential target entity , the reward at the final state is defined as:
where is the binary reward (1 if is in the training KG, 0 otherwise) and is the score from the pre-trained embedding model. This allows the agent to receive partial credit for reaching potentially correct but unobserved answers, mitigating the false negative problem. The embedding model is pre-trained and its parameters are fixed during the RL training.
- Action Dropout (AD): To counter the policy's tendency to converge prematurely to spurious paths and encourage better exploration, the authors propose randomly masking a fraction of outgoing edges (actions) at each step during the training process. This is implemented by perturbing the agent's action probability distribution with a random binary mask :
where is sampled from a Bernoulli distribution with probability (where is the dropout rate), and is a small smoothing constant. This forces the agent to explore a more diverse set of paths beyond the currently highest-scoring ones, making the learned policy more robust to spurious correlations.
Implementation Details:
The walk-based QA is formulated as an MDP:
- State (): The current entity and the query .
- Action (): An outgoing edge from . Actions are represented by concatenating the relation and target entity embeddings. A self-loop action allows termination.
- Policy Network (): An LSTM encodes the history of traversed entities and relations. An MLP takes the current entity embedding, query relation embedding, and the LSTM hidden state to score possible actions (outgoing edges) at the current entity.
- Training: The policy network is trained using the REINFORCE algorithm to maximize the expected cumulative reward, incorporating the knowledge-based reward shaping. For queries with multiple correct answers, each pair is treated as a separate training instance, and other known correct answers (for ) are masked during the last step's target selection to force the agent to walk towards .
- Decoding: Beam search is used to find a set of promising paths. Multiple paths can lead to the same entity; the score for a unique entity is the maximum score among paths reaching it.
- KG Handling: KGs are augmented with inverse relations. Node fan-out is limited by a threshold to manage computational cost.
- Hyperparameters: Key hyperparameters include embedding size, LSTM hidden size, reasoning path length, , action dropout rate , entropy regularization weight, learning rate, batch size, and dropout rates for network layers. The optimal is found to correlate positively with KG density.
Experimental Results and Analysis:
The approach is evaluated on five benchmark KGs: Kinship, UMLS, FB15k-237, WN18RR, and NELL-995.
- The proposed model (Ours), using either ComplEx or ConvE for reward shaping, significantly outperforms previous multi-hop reasoning methods like MINERVA, NeuralLP, and NTP- on most datasets (UMLS, Kinship, FB15k-237).
- It achieves performance consistently comparable to or better than strong embedding-based baselines (DistMult, ComplEx, ConvE), a key contribution as prior path-based methods often lagged behind embeddings on certain datasets.
- An ablation paper shows that both reward shaping (-RS) and action dropout (-AD) contribute significantly to performance improvement, although their relative impact varies across datasets. Removing action dropout generally causes a larger performance drop.
- Analysis of training convergence shows that both RS and AD lead to faster convergence to higher performance levels. AD, in particular, improves performance immediately.
- Examining path diversity reveals that AD substantially increases the number of unique paths explored during training. While RS slightly decreases the number of unique paths (acting as a guide), the combination yields the best results, suggesting guided exploration is key.
- Performance analysis by relation type (to-many vs. to-one) shows that the proposed techniques are generally more effective for relations with multiple answers (to-many).
- Analysis by query type (seen vs. unseen in training data) suggests that RS and AD improve performance on unseen queries, with AD being particularly beneficial for generalization.
Practical Implications:
- The combination of symbolic path search with knowledge from continuous embeddings via reward shaping is a powerful technique for leveraging complementary strengths.
- Action dropout provides a simple yet effective mechanism to encourage exploration and improve the robustness of on-policy RL in structured environments like KGs, mitigating the impact of spurious paths.
- The approach demonstrates that multi-hop reasoning models can achieve state-of-the-art performance competitive with embedding-based models, offering better interpretability through the learned paths.
- Implementation requires integrating standard components like LSTMs, MLPs, and pre-trained embedding models within an RL training loop (REINFORCE). Careful hyperparameter tuning, especially the action dropout rate based on KG structure (density), is important. Handling the large action space necessitates techniques like neighbor pruning.
- The method offers a way to perform inference by exploring interpretable paths, which can be valuable for debugging or explaining model predictions.
The paper concludes that access to a more accurate environment representation (via RS) and thorough exploration (via AD) are crucial for high-performance RL-based KGQA. Future work could explore learnable reward shaping and action dropout, and integrating model-based RL techniques.