Papers
Topics
Authors
Recent
2000 character limit reached

PSL: Rethinking and Improving Softmax Loss from Pairwise Perspective for Recommendation (2411.00163v1)

Published 31 Oct 2024 in cs.LG, cs.AI, and cs.IR

Abstract: Softmax Loss (SL) is widely applied in recommender systems (RS) and has demonstrated effectiveness. This work analyzes SL from a pairwise perspective, revealing two significant limitations: 1) the relationship between SL and conventional ranking metrics like DCG is not sufficiently tight; 2) SL is highly sensitive to false negative instances. Our analysis indicates that these limitations are primarily due to the use of the exponential function. To address these issues, this work extends SL to a new family of loss functions, termed Pairwise Softmax Loss (PSL), which replaces the exponential function in SL with other appropriate activation functions. While the revision is minimal, we highlight three merits of PSL: 1) it serves as a tighter surrogate for DCG with suitable activation functions; 2) it better balances data contributions; and 3) it acts as a specific BPR loss enhanced by Distributionally Robust Optimization (DRO). We further validate the effectiveness and robustness of PSL through empirical experiments. The code is available at https://github.com/Tiny-Snow/IR-Benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. A survey of recommendation systems: recommendation models, techniques, and application fields. Electronics, 11(1):141, 2022.
  2. Deep learning based recommender system: A survey and new perspectives. ACM computing surveys (CSUR), 52(1):1–38, 2019.
  3. Aligning distillation for cold-start item recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1147–1157, 2023.
  4. Large language model interaction simulator for cold-start item recommendation. arXiv preprint arXiv:2402.09176, 2024.
  5. Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
  6. Ir evaluation methods for retrieving highly relevant documents. In ACM SIGIR Forum, volume 51, pages 243–250. ACM New York, NY, USA, 2017.
  7. Optimizing reciprocal rank with bayesian average for improved next item recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2236–2240, 2023.
  8. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pages 173–182, 2017.
  9. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 355–364, 2017.
  10. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 452–461, 2009.
  11. On the effectiveness of sampled softmax loss for item recommendation. ACM Transactions on Information Systems, 42(4):1–26, 2024a.
  12. Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1):857–876, 2021.
  13. Understanding contrastive learning via distributionally robust optimization. Advances in Neural Information Processing Systems, 36, 2024b.
  14. An analysis of the softmax cross entropy loss for learning-to-rank with binary relevance. In Proceedings of the 2019 ACM SIGIR international conference on theory of information retrieval, pages 75–78, 2019.
  15. Bsl: Understanding and improving softmax loss for recommendation. In 2024 IEEE 40th International Conference on Data Engineering (ICDE), pages 816–830. IEEE, 2024c.
  16. Bias and debias in recommender system: A survey and future directions. ACM Transactions on Information Systems, 41(3):1–39, 2023a.
  17. Autodebias: Learning to debias for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–30, 2021.
  18. Llm4dsr: Leveraing large language model for denoising sequential recommendation. arXiv preprint arXiv:2408.08208, 2024a.
  19. Alexander Shapiro. Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275, 2017.
  20. Invariant collaborative filtering to popularity distribution shift. In Proceedings of the ACM Web Conference 2023, pages 1240–1251, 2023.
  21. Popularity bias is not always evil: Disentangling benign and harmful bias for recommendation. IEEE Transactions on Knowledge and Data Engineering, 35(10):9920–9931, 2022.
  22. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 639–648, 2020.
  23. Foundations of machine learning. MIT press, 2018.
  24. Distributionally robust graph-based recommendation system. arXiv preprint arXiv:2402.12994, 2024b.
  25. A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009, 2009.
  26. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  27. Adap-τ𝜏\tauitalic_τ: Adaptively modulating embedding magnitude for recommendation. In Proceedings of the ACM Web Conference 2023, pages 1085–1096, 2023b.
  28. Microsoft recommenders: best practices for production-ready recommendation systems. In Companion Proceedings of the Web Conference 2020, pages 50–51, 2020.
  29. Recommendation systems: Algorithms, challenges, metrics, and business opportunities. applied sciences, 10(21):7748, 2020.
  30. How good your recommender system is? a survey on evaluations in recommendation. International Journal of Machine Learning and Cybernetics, 10:813–831, 2019.
  31. A guided learning approach for item recommendation via surrogate loss learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 605–613, 2021.
  32. Statistical inference. CRC Press, 2024.
  33. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pages 129–136, 2007.
  34. Johan Ludwig William Valdemar Jensen. Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175–193, 1906.
  35. Doro: Distributional and outlier robust optimization. In International Conference on Machine Learning, pages 12345–12355. PMLR, 2021.
  36. Outlier-robust wasserstein dro. Advances in Neural Information Processing Systems, 36, 2024.
  37. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 1(2):9, 2013.
  38. Empowering collaborative filtering with principled adversarial contrastive loss. Advances in Neural Information Processing Systems, 36, 2024.
  39. Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1791–1800, 2021.
  40. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507–517, 2016a.
  41. Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 43–52, 2015.
  42. Friendship and mobility: user movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1082–1090, 2011.
  43. Yelp. Yelp dataset. https://www.yelp.com/dataset, 2018.
  44. Lower-left partial auc: An effective and efficient optimization metric for recommendation. arXiv preprint arXiv:2403.00844, 2024.
  45. Xsimgcl: Towards extremely simple graph contrastive learning for recommendation. IEEE Transactions on Knowledge and Data Engineering, 2023.
  46. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 403–412, 2017.
  47. Handbook of neural computation. CRC Press, 2020.
  48. Latent relational metric learning via memory-based attention for collaborative ranking. In Proceedings of the 2018 world wide web conference, pages 729–739, 2018.
  49. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
  50. Modeling relationships at multiple scales to improve accuracy of large recommender systems. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 95–104, 2007.
  51. Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434, 2008.
  52. Graph neural networks in recommender systems: a survey. ACM Computing Surveys, 55(5):1–37, 2022a.
  53. Graph neural networks for recommender system. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, pages 1623–1625, 2022.
  54. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  55. Neural graph collaborative filtering. In Proceedings of the 42nd international ACM SIGIR conference on Research and development in Information Retrieval, pages 165–174, 2019.
  56. On the equivalence of decoupled graph convolution network and label propagation. In Proceedings of the Web Conference 2021, pages 3651–3662, 2021.
  57. Graph convolution machine for context-aware recommender system. Frontiers of Computer Science, 16(6):166614, 2022b.
  58. Macro graph neural networks for online billion-scale recommender systems. In Proceedings of the ACM on Web Conference 2024, pages 3598–3608, 2024.
  59. Graph convolutional network for recommendation with low-pass collaborative filters. In International Conference on Machine Learning, pages 10936–10945. PMLR, 2020.
  60. Adaptive popularity debiasing aggregator for graph collaborative filtering. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 7–17, 2023.
  61. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  62. Self-supervised graph learning for recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 726–735, 2021.
  63. Autoloss: Automated loss function search in recommendations. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 3959–3967, 2021.
  64. R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  65. Convex optimization. Cambridge university press, 2004.
  66. Vbpr: visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016b.
  67. A survey on contrastive self-supervised learning. Technologies, 9(1):2, 2020.
  68. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  69. Incorporating second-order functional knowledge for better option pricing. Advances in neural information processing systems, 13, 2000.

Summary

  • The paper introduces PSL, a modified Softmax Loss that reformulates the loss from a pairwise perspective using alternative activations.
  • It details the limitations of Softmax Loss, emphasizing its weak alignment with DCG and high sensitivity to false negatives.
  • Empirical experiments demonstrate that PSL improves recommendation accuracy and robustness compared to traditional loss functions.

An Analysis of "PSL: Rethinking and Improving Softmax Loss from Pairwise Perspective for Recommendation"

In the paper titled "PSL: Rethinking and Improving Softmax Loss from Pairwise Perspective for Recommendation," the authors propose a novel approach to addressing limitations of the Softmax Loss (SL) within recommender systems. SL, widely used for its effectiveness in ranking tasks, has two primary limitations: its weak relationship with traditional ranking metrics such as DCG and its sensitivity to false negative instances. Recognizing these issues, the paper introduces the Pairwise Softmax Loss (PSL), a new family of loss functions that make minimal yet impactful alterations to SL by replacing the exponential function with alternative activation functions.

Limitations of Softmax Loss

The paper begins with a detailed analysis of SL, highlighting its drawbacks. While SL is employed to approximate ranking metrics like DCG, the exponential function used in SL leads to a loose connection to these metrics. The disparity is particularly notable when SL attempts to approximate the Heaviside step function, which is crucial in ranking metrics evaluation. Furthermore, SL's reliance on the exponential function exacerbates its sensitivity to noise, specifically false negative instances, due to its propensity to assign disproportionately large weights to noisy data.

Introduction of Pairwise Softmax Loss

To address the inherent limitations of SL, the authors propose PSL, which reformulates SL from a pairwise perspective. By utilizing different activation functions in place of the exponential function, PSL aims to provide a tighter approximation of ranking metrics and better resilience to noise. The PSL framework is explored through variants with activations such as ReLU, Tanh, and Atan, offering flexibility and stronger theoretical backing as a surrogate for ranking metrics.

Theoretical Insights and Empirical Validation

Theoretical analyses in the paper establish that the proposed PSL variants align more closely with DCG by leveraging activation functions that provide tighter bounds. Additionally, addressing the noise sensitivity of SL, the paper posits that PSL functions as a particular form of BPR loss enhanced through Distributionally Robust Optimization (DRO). This endows PSL with enhanced generalization capabilities, making it robust against distribution shifts common in real-world recommender systems.

Empirical experiments play a critical role in validating the proposed loss function. These experiments, conducted across multiple scenarios and datasets, highlight PSL's superiority over existing loss functions such as SL, BPR, AdvInfoNCE, BSL, and LLPAUC. PSL consistently demonstrates improved performance in terms of recommendation accuracy and robustness against noise and distribution shifts, validating the practical efficacy of the theoretical advancements posited by the method.

Implications and Future Directions

The introduction of PSL represents a significant step forward in developing more effective and robust loss functions for recommendation tasks. By providing a framework that allows for flexible adjustments of surrogate activations, PSL offers a more nuanced approach to model training, facilitating better outcomes in various application scenarios.

The implications of this research are substantial for both theoretical and practical domains. On the theoretical front, PSL advances the understanding of the interplay between loss function design and ranking metric optimization. Practically, it equips recommender systems with a tool that can navigate the challenges of data noise and shifting user preferences, thus promising more reliable and accurate recommendations.

Looking forward, further work could explore the extension of PSL to more complex recommendation scenarios or the integration of additional context-aware elements into the PSL framework. There is also potential to optimize the efficiency of these loss functions, particularly in large-scale systems where computational overhead remains a critical concern.

In conclusion, the paper on PSL presents a compelling case for revisiting and refining fundamental components of loss functions in recommender systems. Through a thoughtful reconsideration of the exponential function's role, the authors offer a fresh perspective that strengthens the connection between loss function behavior and practical recommendation outcomes.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 16 likes about this paper.