Semi-Offline Reinforcement Learning for Optimized Text Generation
Abstract: In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline. Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability. We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings. Based on the semi-offline formulation, we present the RL setting that is optimal in terms of optimization cost, asymptotic error, and overfitting error bound. Extensive experiments show that our semi-offline approach is efficient and yields comparable or often better performance compared with state-of-the-art methods.
- An actor-critic algorithm for sequence prediction. In ICLR, 2016.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In IEEvaluation@ACL, 2005.
- Scheduled sampling for sequence prediction with recurrent neural networks. NeurIPS, 28, 2015.
- Personalized chit-chat generation for recommendation using external chat corpora. KDD, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL, 2019.
- Cold-start reinforcement learning with softmax policy gradient. In NeurIPS, 2017.
- Reinforcement routing on proximity graph for efficient recommendation. TOIS, 41:1–27, 2022.
- On overfitting and asymptotic bias in batch reinforcement learning with partial observability. JAIR, 65:1–30, 2019.
- Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, pp.  70, 2019.
- Assessing the factual accuracy of generated text. In KDD, pp. 166–175, 2019.
- Teaching machines to read and comprehend. NeurIPS, 28, 2015.
- Generating multiple-length summaries via reinforcement learning for unsupervised sentence summarization. In EMNLP, 2022.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv:1907.00456, 2019.
- Actor-critic algorithms. NeurIPS, 12, 1999.
- Summac: Re-visiting nli-based models for inconsistency detection in summarization. ACL, 10:163–177, 2021.
- CodeRL: Mastering code generation through pretrained models and deep reinforcement learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), NeurIPS, 2022.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, pp. 7871–7880, 2020.
- Deep reinforcement learning with distributional semantic rewards for abstractive summarization. In EMNLP, pp. 6038–6044, 2019.
- Automatic evaluation of summaries using n-gram co-occurrence statistics. In HLT-NAACL, pp. 150–157, 2003.
- Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692, 2019.
- Brio: Bringing order to abstractive summarization. In ACL, pp. 2890–2903, 2022.
- Scheduled sampling for transformers. ACL 2019, pp. 351, 2019.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In EMNLP, pp. 1797–1807, 2018.
- Reward augmented maximum likelihood for neural structured prediction. NeurIPS, 29, 2016.
- Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
- Text generation by learning from demonstrations. In ICLR, 2020.
- Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318, 2002.
- A deep reinforced model for abstractive summarization. In ICLR, 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(140):1–67, 2020.
- Squad: 100,000+ questions for machine comprehension of text. In EMNLP, pp. 2383–2392, 2016.
- Proximal policy optimization algorithms. arXiv:1707.06347, 2017.
- A deep reinforcement learning chatbot. arXiv:1709.02349, 2017.
- Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2022.
- Policy gradient methods for reinforcement learning with function approximation. NeurIPS, 12, 1999.
- Connecting the dots between mle and rl for sequence generation. ArXiv, abs/1811.09740, 2018.
- Generative Language Models for Paragraph-Level Question Generation. In EMNLP, Abu Dhabi, U.A.E., December 2022.
- Diverse beam search: Decoding diverse solutions from neural sequence models. ArXiv, abs/1610.02424, 2016.
- A reinforcement learning framework for explainable recommendation. ICDM, pp. 587–596, 2018.
- Reinforcing pretrained models for generating attractive text advertisements. In KDD, pp. 3697–3707, 2021.
- Multi-level recommendation reasoning over knowledge graphs with reinforcement learning. Proceedings of the ACM Web Conference 2022, 2022.
- Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
- Reinforcement subgraph reasoning for fake news detection. KDD, 2022.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In ICML, pp. 11328–11339. PMLR, 2020.
- Bertscore: Evaluating text generation with bert. ArXiv, abs/1904.09675, 2019a.
- Continuous sign language recognition via reinforcement learning. In ICIP, pp. 285–289. IEEE, 2019b.
- Leveraging demonstrations for reinforcement recommendation reasoning over knowledge graphs. SIGIR, 2020.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Scaling pareto-efficient decision making via offline multi-objective rl. In ICLR, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.