A Re-evaluation of Knowledge Graph Completion Methods
The paper titled "A Re-evaluation of Knowledge Graph Completion Methods" undertakes a comprehensive inquiry into the evaluation protocols used in Knowledge Graph Completion (KGC), a task aimed at predicting missing links in large-scale knowledge graphs. The main assertion of the paper is that the evaluation techniques deployed by numerous recent neural network-based methods have contributed to inflated claims of performance improvement, thereby misleading the research community.
The authors argue that many KGC models have reported unusually high success, which can be attributed to the adoption of inappropriate evaluation protocols, particularly those that artificially enhance performance metrics. This study specifically critiques methods relying on complex neural network architectures like CNNs, RNNs, GNNs, and Capsule Networks, which, according to the authors, are often assessed with lax evaluation strategies that do not withstand rigorous scrutiny.
The core of the authors' proposition involves the introduction of a robust, straightforward evaluation protocol that mitigates these biases. They categorize existing methods into "Non-Affected" and "Affected" classes based on their susceptibility to different evaluation protocols. According to their analysis, methods such as ConvE, RotatE, and TuckER belong to the "Non-Affected" class, showing consistent performance irrespective of the protocol used. In contrast, methods like ConvKB and CapsE fall into the "Affected" class, with their performance varying significantly depending on the evaluation setup.
A significant focus is given to how these methods handle scores. Specifically, the authors highlight cases where models assign identical scores to multiple triplets, which can lead to an erroneous boost in mean reciprocal rank (MRR) and hits@10 metrics under certain evaluation settings. They propose three evaluation protocols—Top, Bottom, and Random—with the Random protocol presented as the most balanced approach, providing neither an unfair advantage nor disadvantage to models with such tied scores.
Experimentation on datasets like FB15k-237 and WN18RR illustrates how switching from permissive to more stringent evaluation methods impacts the perceived performance of these models. The results reveal a marked discrepancy in performance metrics for Affected methods under different evaluation protocols, drawing attention to the importance of adopting rigorous and fair evaluation frameworks, like the proposed Random protocol, for future KGC research.
This study serves not only as a corrective lens over prior evaluation techniques but also nudges the community towards more transparent and reliable model assessments. It urges developers and researchers to reconsider how scores are computed and results are interpreted, fostering a more grounded comparison across different methodologies.
In terms of practical implications, this work emphasizes the need for replication and validation in research findings and suggests that minor changes in evaluation procedures can have large ramifications in reported model capabilities. Theoretically, this urges reevaluation of existing results and calls for more rigorous benchmarking in AI model assessment.
Looking forward, this work implies that as KGC techniques continue to evolve, both methodological innovations and evaluation criteria must be rigorously revisited to ensure genuine progress is accurately captured and reported. This paper lays the groundwork for future studies to build upon more robust and equitable evaluation techniques not only in KGC but potentially across other AI domains reliant on complex model architectures.