Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender System (1809.07428v1)

Published 19 Sep 2018 in cs.LG, cs.IR, and stat.ML

Abstract: We propose a novel way to train ranking models, such as recommender systems, that are both effective and efficient. Knowledge distillation (KD) was shown to be successful in image recognition to achieve both effectiveness and efficiency. We propose a KD technique for learning to rank problems, called \emph{ranking distillation (RD)}. Specifically, we train a smaller student model to learn to rank documents/items from both the training data and the supervision of a larger teacher model. The student model achieves a similar ranking performance to that of the large teacher model, but its smaller model size makes the online inference more efficient. RD is flexible because it is orthogonal to the choices of ranking models for the teacher and student. We address the challenges of RD for ranking problems. The experiments on public data sets and state-of-the-art recommendation models showed that RD achieves its design purposes: the student model learnt with RD has a model size less than half of the teacher model while achieving a ranking performance similar to the teacher model and much better than the student model learnt without RD.

PDF Abstract

Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender Systems

In the domain of information retrieval (IR), particularly within recommender systems, the challenge of achieving both effective performance and efficient computation continues to drive research efforts. The paper "Ranking Distillation: Learning Compact Ranking Models With High Performance for Recommender Systems" tackles this challenge by proposing a novel approach termed Ranking Distillation (RD), which is a derivative of the concept of Knowledge Distillation (KD) used in image recognition.

Overview of Ranking Distillation

Ranking Distillation (RD) is a technique that trains a compact student model to mimic the ranking performance of a larger, more powerful teacher model. The student model is trained not only on the usual labeled data but also on additional supervisory signals derived from the teacher model's evaluation of unlabeled data. This approach is designed to allow the student model to retain high-ranking efficacy comparable to the teacher model, while maintaining a smaller model size to enhance the efficiency of online inference.

Methodology

The critical innovation of RD lies in its formulation and implementation. Unlike classical KD, which deals mostly with classification tasks, RD addresses ranking tasks, which focus on ordering rather than labeling classes. The paper outlines a robust methodology where the teacher model generates a top-K ranking of unlabeled items for each query, providing a form of extra-data driven supervision to guide the student model’s learning. The loss function for the student model includes both the traditional ranking loss and a distillation loss, which incorporates weighting schemes to emphasize important positions and ranking discrepancies. These weighting schemes ensure that the student model pays more attention to the teacher's top-ranked documents and is flexible enough to adapt to various data sets.

Empirical Evaluation

Using real-world datasets Gowalla and Foursquare, the paper demonstrates that RD effectively achieves its dual goals. The student models trained with RD attained ranking performances similar to or even exceeding their teacher models, while having model sizes less than half that of the teacher models. Notably, the empirical results show improvements over the student models trained without RD, signifying that the extra supervisory signals from the teacher indeed contribute to better generalization and performance.

Practical and Theoretical Implications

The practical implications of this research are significant for recommender systems where efficiency is paramount, especially in environments where rapid online inference is required, such as in sequential recommendations. Theoretically, RD offers an extension to the concept of knowledge distillation, making it applicable to ranking dimensional tasks and suggesting potential applications in other domains of IR beyond recommendation systems.

Future Directions

Future research could explore further refinements to the distillation process, such as exploring adaptive weighting mechanisms or integrating contextual information more deeply into the distillation framework. Additionally, the scalability of RD techniques on even larger datasets could be probed, possibly involving hybrid models that leverage sparse and dense data representations effectively.

Overall, this paper contributes a compelling approach to resolving the trade-off between model performance and computational efficiency, pushing forward the capabilities of recommender systems with minimal compromise on effectiveness.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Jiaxi Tang (6 papers)
Ke Wang (529 papers)

Citations (165)

View on Semantic Scholar