Token-level Proximal Policy Optimization for Query Generation (2411.00722v1)
Abstract: Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage LLMs for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO, we conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine. The experimental results demonstrate that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.
- Charu C Aggarwal et al. 2016. Recommender systems. Vol. 1. Springer.
- RUDDER: Return Decomposition for Delayed Rewards. arXiv:1806.07857Â [cs.LG] https://arxiv.org/abs/1806.07857
- Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion. https://doi.org/10.48550/arXiv.2311.06318 arXiv:2311.06318 [cs].
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
- Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems. 1007–1014.
- A survey of query auto completion in information retrieval. Foundations and Trends® in Information Retrieval 10, 4 (2016), 273–363.
- Drlc: Reinforcement learning with dense rewards from llm critic. arXiv preprint arXiv:2401.07382 (2024).
- Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669 (2024).
- Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
- Openagi: When llm meets domain experts. Advances in Neural Information Processing Systems 36 (2024).
- Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment. arXiv:2311.04072Â [cs.CL] https://arxiv.org/abs/2311.04072
- Web query recommendation via sequential query prediction. In 2009 IEEE 25th international conference on data engineering. IEEE, 1443–1454.
- Optimizing Agent Behavior over Long Time Scales by Transporting Value. arXiv:1810.06721Â [cs.AI] https://arxiv.org/abs/1810.06721
- Session-based recommender systems. In Recommender Systems Handbook. Springer, 301–334.
- Sham M. Kakade and John Langford. 2002. Approximately Optimal Approximate Reinforcement Learning. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:31442909
- Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452 (2023).
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267 (2023).
- GPT4Rec: A generative framework for personalized recommendation and user interests interpretation. arXiv preprint arXiv:2304.03879 (2023).
- A Survey of Generative Search and Recommendation in the Era of Large Language Models. arXiv:2404.16924Â [cs.IR] https://arxiv.org/abs/2404.16924
- Let’s verify step by step. arXiv preprint arXiv:2305.20050 (2023).
- How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv:2306.05817Â [cs.IR] https://arxiv.org/abs/2306.05817
- Reproducing Personalised Session Search over the AOL Query Log. arXiv:2201.08622Â [cs.IR] https://arxiv.org/abs/2201.08622
- On the study of transformers for query suggestion. ACM Transactions on Information Systems (TOIS) 40, 1 (2021), 1–27.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290Â [cs.LG] https://arxiv.org/abs/2305.18290
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36 (2024).
- Proximal Policy Optimization Algorithms. arXiv:1707.06347Â [cs.LG] https://arxiv.org/abs/1707.06347
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275 (2022).
- Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521 (2023).
- Can Small Language Models be Good Reasoners for Sequential Recommendation?. In Proceedings of the ACM on Web Conference 2024. 3876–3887.
- Llmrec: Large language models with graph augmentation for recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 806–815.
- Query suggestion with feedback memory network. In Proceedings of the 2018 World Wide Web Conference. 1563–1571.
- A survey on large language models for recommendation. World Wide Web 27, 5 (2024), 60.
- Fine-Grained Human Feedback Gives Better Rewards for Language Model Training. arXiv:2306.01693Â [cs.CL] https://arxiv.org/abs/2306.01693
- Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models without Preference Data. ([n. d.]).
- Aligning Large Language Models via Fine-grained Supervision. http://arxiv.org/abs/2406.02756 arXiv:2406.02756 [cs].
- PALR: Personalization Aware LLMs for Recommendation. arXiv:2305.07622Â [cs.IR] https://arxiv.org/abs/2305.07622
- Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - Industry Track, Singapore, December 6-10, 2023, Mingxuan Wang and Imed Zitouni (Eds.). Association for Computational Linguistics, 294–312. https://doi.org/10.18653/V1/2023.EMNLP-INDUSTRY.29
- Tianbao Yang and Yiming Ying. 2022. AUC Maximization in the Era of Big Data and AI: A Survey. arXiv:2203.15046Â [cs.LG] https://arxiv.org/abs/2203.15046
- LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking. arXiv:2311.02089Â [cs.IR] https://arxiv.org/abs/2311.02089
- Token-level Direct Preference Optimization. arXiv preprint arXiv:2404.11999 (2024).
- Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 1128–1136.
- Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046 (2023).
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2023), 46595–46623.
- Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922 (2024).
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.