Efficient replay-buffer sampling strategies for ReVal
Investigate more efficient replay-buffer sampling strategies for the ReVal off-policy value-based reinforcement learning framework for large language models, such as prioritized experience replay, to improve experience reuse beyond the current uniform FIFO approach.
References
We leave the exploration of more efficient sampling strategies, such as prioritized experience replay (Schaul et al., 2015), to future work.
— Off-Policy Value-Based Reinforcement Learning for Large Language Models
(2603.23355 - Wang et al., 24 Mar 2026) in Section 4.3, Replay Buffer for Off-Policy Learning