Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Value Penalized Q-Learning for Recommender Systems (2110.07923v2)

Published 15 Oct 2021 in cs.LG and cs.AI

Abstract: Scaling reinforcement learning (RL) to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS, i.e., improving customers' long-term satisfaction. A key approach to this goal is offline RL, which aims to learn policies from logged data. However, the high-dimensional action space and the non-stationary dynamics in commercial RS intensify distributional shift issues, making it challenging to apply offline RL methods to RS. To alleviate the action distribution shift problem in extracting RL policy from static trajectories, we propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm. It penalizes the unstable Q-values in the regression target by uncertainty-aware weights, without the need to estimate the behavior policy, suitable for RS with a large number of items. We derive the penalty weights from the variances across an ensemble of Q-functions. To alleviate distributional shift issues at test time, we further introduce the critic framework to integrate the proposed method with classic RS models. Extensive experiments conducted on two real-world datasets show that the proposed method could serve as a gain plugin for existing RS models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chengqian Gao (5 papers)
  2. Ke Xu (309 papers)
  3. Kuangqi Zhou (10 papers)
  4. Lanqing Li (21 papers)
  5. Xueqian Wang (99 papers)
  6. Bo Yuan (151 papers)
  7. Peilin Zhao (127 papers)
Citations (20)

Summary

We haven't generated a summary for this paper yet.