- The paper introduces a novel reinforcement learning approach that directly models the correlation between user engagement and item recommendations using offline interaction logs.
- It employs a distributional RL framework to capture the uncertainty in stochastic user feedback and termination events, improving reward estimation.
- Experimental validation in a short-video recommendation system demonstrates significant gains in engagement metrics such as CTR, CVR, video views, and duration.
On Modeling Long-Term User Engagement from Stochastic Feedback
Introduction
The paper "On Modeling Long-Term User Engagement from Stochastic Feedback" (2302.06101) proposes a novel approach to optimize long-term user engagement in recommender systems (RS) using reinforcement learning (RL). Current RL-based methods pose significant computational challenges due to the requirement of storing candidate items. The paper introduces an efficient alternative by directly modeling the correlation between user engagement and items from data, addressing randomness in user feedback and termination, which is often overlooked in prior work.
Background and Challenges
Traditional RL methods for RS involve modeling interactions between an agent and an environment using the Markov Decision Process (MDP) framework. Though promising for optimizing long-term engagement by utilizing items' utility as rewards, these methods suffer from excessive computational overhead and often require storing large sets of candidate items.
Moreover, these approaches usually assume deterministic rewards and infinite interactions, which do not account for the variability in user feedback and the often finite duration of interactions within RS. These assumptions limit their applicability in real-world scenarios where user preferences and interaction lengths are stochastic.
Proposed Methodology
Efficient Modeling Approach
The paper proposes an alternative method by leveraging the information captured in behavior policies of existing RS. These industrial systems are frequently updated and optimized, generating valuable sequential interaction logs. The approach seeks to derive a model for user engagement by analyzing the relation between recommended items and engagement within these logs, eliminating the need to store candidate item sets.
Addressing Stochastic Feedback
To manage randomness in user feedback, the paper employs a distributional RL framework which models the state-action value distribution rather than its expectation. This provides a richer representation of uncertainty in rewards. Additionally, the paper addresses random termination behavior by predicting the likelihood of interaction continuation, refining cumulative reward estimation.
Implementation Optimization
The paper extends distributional RL with a modified Bellman operator that incorporates random termination probabilities into reward computation. This results in an efficient adaptation of existing RS to model long-term engagement using offline data, marked by reduced computational demands compared to conventional RL methods.
Experimental Validation
Extensive real-world tests were conducted using an industrial RS serving short video recommendations. Through A/B testing, the proposed method demonstrated significant improvements in metrics for long-term engagement, including video views (VV) and duration (DUR). Notably, user interaction metrics like CTR and CVR also showed substantial gains, confirming the effectiveness of integrating stochastic feedback modeling.
Comparison against deterministic variants showed the importance of considering randomness in both rewards and termination. Specifically, improvements in metrics highlighted the critical impact of modeling stochastic elements to optimize engagement comprehensively.
Conclusion
This research contributes a computationally efficient method to model and enhance long-term user engagement in RS without the necessity of storing large candidate item sets. By addressing the stochastic nature of user feedback and interaction termination, the proposed method provides systemic upgrades to current RS paradigms in industrial applications. Future developments could further refine the predictive accuracy of termination probabilities and the robustness of the engagement model across diverse RS categories.