Knowledge Base Completion: Baselines Strike Back (1705.10744v1)

Published 30 May 2017 in cs.LG and cs.AI

Abstract: Many papers have been published on the knowledge base completion task in the past few years. Most of these introduce novel architectures for relation learning that are evaluated on standard datasets such as FB15k and WN18. This paper shows that the accuracy of almost all models published on the FB15k can be outperformed by an appropriately tuned baseline - our reimplementation of the DistMult model. Our findings cast doubt on the claim that the performance improvements of recent models are due to architectural changes as opposed to hyper-parameter tuning or different training objectives. This should prompt future research to re-consider how the performance of models is evaluated and reported.

Citations (187)

View on Semantic Scholar

Summary

The paper demonstrates that rigorous hyper-parameter tuning of the DistMult model significantly improves performance, achieving superior MRR and Hits@10 scores on FB15k and WN18.
The study reveals that optimizing factors like batch size and negative sampling frequency can outperform newer architectures in knowledge base completion tasks.
The findings prompt a reexamination of evaluation practices in KBC research, advocating comprehensive empirical studies to differentiate tuning effects from architectural innovations.

Knowledge Base Completion: Baselines Strike Back

The paper "Knowledge Base Completion: Baselines Strike Back" offers a critical examination of recent advancements in knowledge base completion (KBC) models. The authors challenge the prevailing assertion that architectural innovations primarily drive performance improvements. Instead, they demonstrate that meticulous hyper-parameter tuning of established baseline models, such as DistMult, can outperform newer models across standard datasets, including FB15k and WN18.

Core Arguments

The authors begin by elucidating the KBC task, which involves predicting missing entities in knowledge base triples, such as determining the object in the query "Donald Trump, president of, ?". The paper argues that current methodologies overemphasize architectural complexity while overlooking the importance of tuning and training objectives.

Methodology

The authors opted for a reimplementation of the DistMult model, a distributional approach where entities and relations are represented as real-valued vectors. By meticulously adjusting hyper-parameters like batch size, embedding dimensionality, and negative sampling frequency, the authors optimized the DistMult model to evaluate triples' likelihood using a softmax function. This rigorous optimization resulted in superior performance metrics, specifically in mean reciprocal rank (MRR) and Hits@10, outperforming most newer approaches in FB15k and WN18 datasets.

Results

The refined DistMult model achieved competitive performance, notably outperforming 27 of 29 models on FB15k in Hits@10 and achieving the highest MRR score reported for the dataset. These results underscore the potential of hyper-parameter optimization over novel architectural changes, suggesting that advancements might be partially attributed to tuning rather than inherently improved models. Interestingly, larger batch sizes consistently enhanced performance, revealing a promising area for further boosting existing models' effectiveness.

Implications and Future Directions

These findings prompt a reevaluation of current evaluation practices in KBC research, suggesting a shift towards more comprehensive empirical studies to discern genuine architectural improvements from hyper-parameter effects. The paper advocates for greater focus on less frequent metrics like Hits@1 and MRR, to provide a nuanced understanding of model capabilities, particularly in datasets where models achieve uniformly high Hits@10 scores yet vary in other metrics.

Additionally, the authors encourage exploration of the raw evaluation scenario, which may offer more realistic insights than the current filtered approach. Future research could benefit from large-scale empirical comparisons of KBC algorithms, akin to studies conducted on LLMs, to establish standardized benchmarks and advance theoretical developments.

Conclusion

This paper critically interrogates the performance metrics and underlying assumptions in KBC research, emphasizing the substantial impact that intuitive hyper-parameter tuning can have on established models. Such insights have profound implications for machine learning research, encouraging a balanced approach that considers both architectural innovations and rigorous optimization strategies.

PDF Markdown