Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 17 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 97 tok/s
GPT OSS 120B 455 tok/s Pro
Kimi K2 194 tok/s Pro
2000 character limit reached

A Worrying Reproducibility Study of Intent-Aware Recommendation Models (2501.10143v1)

Published 17 Jan 2025 in cs.IR

Abstract: Lately, we have observed a growing interest in intent-aware recommender systems (IARS). The promise of such systems is that they are capable of generating better recommendations by predicting and considering the underlying motivations and short-term goals of consumers. From a technical perspective, various sophisticated neural models were recently proposed in this emerging and promising area. In the broader context of complex neural recommendation models, a growing number of research works unfortunately indicates that (i) reproducing such works is often difficult and (ii) that the true benefits of such models may be limited in reality, e.g., because the reported improvements were obtained through comparisons with untuned or weak baselines. In this work, we investigate if recent research in IARS is similarly affected by such problems. Specifically, we tried to reproduce five contemporary IARS models that were published in top-level outlets, and we benchmarked them against a number of traditional non-neural recommendation models. In two of the cases, running the provided code with the optimal hyperparameters reported in the paper did not yield the results reported in the paper. Worryingly, we find that all examined IARS approaches are consistently outperformed by at least one traditional model. These findings point to sustained methodological issues and to a pressing need for more rigorous scholarly practices.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper found that replicating five state-of-the-art intent-aware recommender models often failed to reproduce the original results.
  • It employed optimal hyperparameters and compared results with well-tuned traditional methods like RP3B and EASER, exposing methodological flaws.
  • The findings urge greater transparency and rigorous evaluation practices to ensure reliability in recommendation system research.

Analysis of Reproducibility in Intent-Aware Recommender Systems

The paper "A Worrying Reproducibility Study of Intent-Aware Recommendation Models" offers a detailed examination of reproducibility concerns within the field of intent-aware recommender systems (IARS). IARS, which account for underlying motivations and user intents, have generated significant interest due to their potential to improve recommendation quality. However, the paper scrutinizes this assertion by investigating whether recent advancements in IARS genuinely constitute progress when benchmarked against traditional non-neural models.

Research Objectives and Methodology

The primary objective is to assess the reproducibility of results from contemporary IARS models. Specifically, the authors attempted to replicate results from five state-of-the-art IARS models, which were previously published in top-tier venues. Their replication efforts involved running the code with optimal hyperparameters as reported and benchmarking these models against traditional recommendation methods. The choice of baselines includes well-tuned non-neural methods such as ItemKNN, UserKNN, RP3B, and EASER.

The authors utilized a structured approach to identify relevant papers by querying major academic databases for the terms 'intent' and 'recommend', ultimately narrowing down 88 papers to 13 based on specific criteria, such as publication venue and recency. From these, only five papers provided sufficient artifacts needed for an empirical investigation.

Reproducibility Findings

A significant finding of this paper is the challenge of reproducing the results even when using the author's code and specified hyperparameters. In two cases, the figures reported in the original papers could not be achieved, highlighting a gap between documented and observed performance.

Moreover, all examined IARS models were outdone by at least one traditional model, questioning the claimed superiority of IARS in empirical settings. Notably, the simpler, less computationally intensive models like RP3B and EASER consistently showcased competitive, if not superior, performance compared to their more complex, neural counterparts.

Methodological Concerns and Recommendations

The paper draws attention to several methodological flaws prevalent in IARS research:

  • Baseline Selection: Many IARS comparisons were against weak or improperly configured baselines, skewing the interpretation of results.
  • Hyperparameter Tuning: The flawed practice of using fixed settings for critical parameters, such as embedding sizes, was observed. This limits the potential effectiveness assessment of traditional algorithms.
  • Lack of Transparency: A general lack of detailed documentation or missing code artifacts hinders reproducibility. Moreover, there was a notable absence of response from original authors when approached for further support.

Implications and Future Directions

The findings have important implications, urging a reevaluation of current research practices in developing advanced recommendation models. There is a call for heightened scientific rigor, particularly in verifying reproducibility prior to publishing findings. Researchers should be incentivized to share all necessary artifacts that support reproducibility comprehensively.

The reproducibility crisis further underscores the necessity for openness in research, where shared resources and methodologies can foster collective progress. Looking forward, the paper implies that ensuring reproducibility is not merely a technicality but a cornerstone for scientific advancement. This cultural shift would be pivotal in moving beyond incremental improvements towards truly innovative systems in AI and recommendation technologies. More evaluations on different recommender model families and consistent benchmarking with well-tuned baselines are crucial steps towards more reliable research findings.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.