Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches (1907.06902v3)

Published 16 Jul 2019 in cs.IR, cs.LG, and cs.NE

Abstract: Deep learning techniques have become the method of choice for researchers working on algorithmic aspects of recommender systems. With the strongly increased interest in machine learning in general, it has, as a result, become difficult to keep track of what represents the state-of-the-art at the moment, e.g., for top-n recommendation tasks. At the same time, several recent publications point out problems in today's research practice in applied machine learning, e.g., in terms of the reproducibility of the results or the choice of the baselines when proposing new models. In this work, we report the results of a systematic analysis of algorithmic proposals for top-n recommendation tasks. Specifically, we considered 18 algorithms that were presented at top-level research conferences in the last years. Only 7 of them could be reproduced with reasonable effort. For these methods, it however turned out that 6 of them can often be outperformed with comparably simple heuristic methods, e.g., based on nearest-neighbor or graph-based techniques. The remaining one clearly outperformed the baselines but did not consistently outperform a well-tuned non-neural linear ranking method. Overall, our work sheds light on a number of potential problems in today's machine learning scholarship and calls for improved scientific practices in this area. Source code of our experiments and full results are available at: https://github.com/MaurizioFD/RecSys2019_DeepLearning_Evaluation.

Authors (3)

Maurizio Ferrari Dacrema (26 papers)
Paolo Cremonesi (31 papers)
Dietmar Jannach (53 papers)

Citations (563)

View on Semantic Scholar

Summary

The paper reveals that only 7 out of 18 neural recommendation models are reproducible, challenging the perceived progress in the field.
It finds that advanced deep learning models often underperform compared to simpler heuristic baselines in top-n recommendation tasks.
The study advocates for improved transparency, rigorous baseline selection, and standardized evaluation practices to foster genuine advancements.

Evaluating the Progress in Neural Recommendation Systems through Reproducibility and Baseline Comparisons

Introduction

The rapid adoption of deep learning techniques in the field of recommender systems has heralded a new era of algorithmic advancements. However, this swift surge in neural recommendation approaches brings forth critical challenges linked to the reproducibility of results and the selection of benchmarks. This write-up critically evaluates the findings from a systematic analysis of 18 algorithms presented at top-tier research conferences, focusing on top-n recommendation tasks. The core concern addressed is whether the contemporary deep learning-based recommendation models genuinely surpass more straightforward heuristic methods when rigorously re-evaluated.

Reproducibility and Scalability

Of the 18 algorithms scrutinized, only 7 could be reproduced with reasonable effort. This finding underscores a prevalent issue within machine learning applied to recommender systems: the lack of comprehensive sharing of code and data, hindering the ability to replicate studies fully. Despite the advantage that digital research artifacts present in terms of reproducibility, the paper reveals a disconcerting trend where, even when source codes are available, they often lack critical components like data preprocessing steps or hyper-parameter tuning details. This gap significantly impedes validating and comparing new models against existing ones, thus clouding the assessment of genuine progress in the field.

Baseline Comparisons and Evaluation

The re-examination of these models, against a set of both personalized and non-personalized heuristics, reveals startling insights. The analysis points out that in the majority of cases, these advanced neural models do not offer significant improvements over simpler methods like nearest-neighbor or graph-based techniques. For instance, one of the evaluated models was consistently outperformed by heuristic baselines across various datasets. Additionally, a non-personalized method recommending the most popular items performed remarkably well on a dataset, further questioning the effectiveness of complex models. This outcome raises critical inquiries about the real advancements that deep learning has brought to the recommender systems domain, suggesting that perhaps the progress may not be as profound as previously perceived.

Furthermore, the evaluation strategy exposes the problem with using certain neural recommendation algorithms as benchmarks. It shows that these models, when properly challenged with well-tuned simpler methods, often fall short of delivering superior performance. This revelation emphasizes the need for a more meticulous selection and optimization of baselines in future research to ensure fair and meaningful comparisons.

Recommendations for Future Research

Given these findings, the paper advocates for improved scientific practices in researching recommender systems, emphasizing the necessity for greater transparency in sharing research artifacts and more rigorous baseline selection and tuning. The pursuit of progress should not merely be the achievement of incremental accuracies but should aim for innovative solutions that are both reproducible and demonstrably superior to existing methods.

Additionally, the paper calls for a shift in the evaluation culture, recommending that future studies should prioritize meaningful metrics and evaluation protocols tailored to specific application contexts. A move towards standardized datasets and more comprehensive evaluation frameworks could facilitate more objective assessments of algorithmic advancements in the field.

Furthermore, future explorations should extend beyond the current scope and include a wider range of publication venues and recommendation tasks. Incorporating traditional algorithms like matrix factorization as benchmarks could also provide a more holistic view of the advancements in neural recommendation techniques.

Conclusion

This critical analysis sheds light on concerning trends in the domain of neural recommendation systems, notably around the issues of reproducibility and the true extent of progress achieved. By highlighting the effectiveness of simpler heuristics against state-of-the-art neural models, the paper calls into question the real advancements brought by deep learning to recommender systems. Moving forward, the community must adopt more rigorous scientific practices and evaluation standards to foster meaningful and verifiable progress in the development of recommendation algorithms.

PDF Markdown

Related Papers

GitHub

GitHub - MaurizioFD/RecSys2019_DeepLearning_Evaluation: This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies. (986 stars)

YouTube

Show All Videos