Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 445 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Token-level Proximal Policy Optimization for Query Generation (2411.00722v1)

Published 1 Nov 2024 in cs.LG

Abstract: Query generation is a critical task for web search engines (e.g. Google, Bing) and recommendation systems. Recently, state-of-the-art query generation methods leverage LLMs for their strong capabilities in context understanding and text generation. However, they still face challenges in generating high-quality queries in terms of inferring user intent based on their web search interaction history. In this paper, we propose Token-level Proximal Policy Optimization (TPPO), a noval approach designed to empower LLMs perform better in query generation through fine-tuning. TPPO is based on the Reinforcement Learning from AI Feedback (RLAIF) paradigm, consisting of a token-level reward model and a token-level proximal policy optimization module to address the sparse reward challenge in traditional RLAIF frameworks. To evaluate the effectiveness and robustness of TPPO, we conducted experiments on both open-source dataset and an industrial dataset that was collected from a globally-used search engine. The experimental results demonstrate that TPPO significantly improves the performance of query generation for LLMs and outperforms its existing competitors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Charu C Aggarwal et al. 2016. Recommender systems. Vol. 1. Springer.
  2. RUDDER: Return Decomposition for Delayed Rewards. arXiv:1806.07857 [cs.LG] https://arxiv.org/abs/1806.07857
  3. Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion. https://doi.org/10.48550/arXiv.2311.06318 arXiv:2311.06318 [cs].
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
  6. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems. 1007–1014.
  7. A survey of query auto completion in information retrieval. Foundations and Trends® in Information Retrieval 10, 4 (2016), 273–363.
  8. Drlc: Reinforcement learning with dense rewards from llm critic. arXiv preprint arXiv:2401.07382 (2024).
  9. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669 (2024).
  10. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30 (2017).
  11. Openagi: When llm meets domain experts. Advances in Neural Information Processing Systems 36 (2024).
  12. Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment. arXiv:2311.04072 [cs.CL] https://arxiv.org/abs/2311.04072
  13. Web query recommendation via sequential query prediction. In 2009 IEEE 25th international conference on data engineering. IEEE, 1443–1454.
  14. Optimizing Agent Behavior over Long Time Scales by Transporting Value. arXiv:1810.06721 [cs.AI] https://arxiv.org/abs/1810.06721
  15. Session-based recommender systems. In Recommender Systems Handbook. Springer, 301–334.
  16. Sham M. Kakade and John Langford. 2002. Approximately Optimal Approximate Reinforcement Learning. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:31442909
  17. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452 (2023).
  18. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267 (2023).
  19. GPT4Rec: A generative framework for personalized recommendation and user interests interpretation. arXiv preprint arXiv:2304.03879 (2023).
  20. A Survey of Generative Search and Recommendation in the Era of Large Language Models. arXiv:2404.16924 [cs.IR] https://arxiv.org/abs/2404.16924
  21. Let’s verify step by step. arXiv preprint arXiv:2305.20050 (2023).
  22. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv:2306.05817 [cs.IR] https://arxiv.org/abs/2306.05817
  23. Reproducing Personalised Session Search over the AOL Query Log. arXiv:2201.08622 [cs.IR] https://arxiv.org/abs/2201.08622
  24. On the study of transformers for query suggestion. ACM Transactions on Information Systems (TOIS) 40, 1 (2021), 1–27.
  25. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
  26. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. arXiv:2305.18290 [cs.LG] https://arxiv.org/abs/2305.18290
  27. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36 (2024).
  28. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs.LG] https://arxiv.org/abs/1707.06347
  29. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  31. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275 (2022).
  32. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv preprint arXiv:2310.07521 (2023).
  33. Can Small Language Models be Good Reasoners for Sequential Recommendation?. In Proceedings of the ACM on Web Conference 2024. 3876–3887.
  34. Llmrec: Large language models with graph augmentation for recommendation. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining. 806–815.
  35. Query suggestion with feedback memory network. In Proceedings of the 2018 World Wide Web Conference. 1563–1571.
  36. A survey on large language models for recommendation. World Wide Web 27, 5 (2024), 60.
  37. Fine-Grained Human Feedback Gives Better Rewards for Language Model Training. arXiv:2306.01693 [cs.CL] https://arxiv.org/abs/2306.01693
  38. Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models without Preference Data. ([n. d.]).
  39. Aligning Large Language Models via Fine-grained Supervision. http://arxiv.org/abs/2406.02756 arXiv:2406.02756 [cs].
  40. PALR: Personalization Aware LLMs for Recommendation. arXiv:2305.07622 [cs.IR] https://arxiv.org/abs/2305.07622
  41. Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - Industry Track, Singapore, December 6-10, 2023, Mingxuan Wang and Imed Zitouni (Eds.). Association for Computational Linguistics, 294–312. https://doi.org/10.18653/V1/2023.EMNLP-INDUSTRY.29
  42. Tianbao Yang and Yiming Ying. 2022. AUC Maximization in the Era of Big Data and AI: A Survey. arXiv:2203.15046 [cs.LG] https://arxiv.org/abs/2203.15046
  43. LlamaRec: Two-Stage Recommendation using Large Language Models for Ranking. arXiv:2311.02089 [cs.IR] https://arxiv.org/abs/2311.02089
  44. Token-level Direct Preference Optimization. arXiv preprint arXiv:2404.11999 (2024).
  45. Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 1128–1136.
  46. Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046 (2023).
  47. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2023), 46595–46623.
  48. Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922 (2024).
  49. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 2 likes.

Upgrade to Pro to view all of the tweets about this paper: