Efficient replay-buffer sampling strategies for ReVal

Investigate more efficient replay-buffer sampling strategies for the ReVal off-policy value-based reinforcement learning framework for large language models, such as prioritized experience replay, to improve experience reuse beyond the current uniform FIFO approach.

Background

ReVal introduces a replay buffer to enable off-policy updates in value-based reinforcement learning for LLMs. The current implementation uses a first-in-first-out (FIFO) buffer with uniform sampling, giving each trajectory an expected reuse determined by the buffer size, batch size, and number of updates per iteration.

While this design already provides off-policy reuse, the authors explicitly note that more efficient sampling strategies could further improve training efficiency. They specifically mention prioritized experience replay as a promising direction, but do not develop or evaluate such methods in the paper, leaving this as future work.

References

We leave the exploration of more efficient sampling strategies, such as prioritized experience replay (Schaul et al., 2015), to future work.

Off-Policy Value-Based Reinforcement Learning for Large Language Models  (2603.23355 - Wang et al., 24 Mar 2026) in Section 4.3, Replay Buffer for Off-Policy Learning