Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarization (2402.15473v2)

Published 23 Feb 2024 in cs.CL and cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a dominating strategy in aligning LLMs (LMs) with human values/goals. The key to the strategy is learning a reward model ($\varphi$), which can reflect the latent reward model of humans. While this strategy has proven effective, the training methodology requires a lot of human preference annotation (usually in the order of tens of thousands) to train $\varphi$. Such a large-scale annotation is justifiable when it's a one-time effort, and the reward model is universally applicable. However, human goals are subjective and depend on the task, requiring task-specific preference annotations, which can be impractical to fulfill. To address this challenge, we propose a novel approach to infuse domain knowledge into $\varphi$, which reduces the amount of preference annotation required ($21\times$), omits Alignment Tax, and provides some interpretability. We validate our approach in E-Commerce Opinion Summarization, with a significant reduction in dataset size (to just $940$ samples) while advancing the SOTA ($\sim4$ point ROUGE-L improvement, $68\%$ of times preferred by humans over SOTA). Our contributions include a novel Reward Modeling technique and two new datasets: PromptOpinSumm (supervised data for Opinion Summarization) and OpinPref (a gold-standard human preference dataset). The proposed methodology opens up avenues for efficient RLHF, making it more adaptable to applications with varying human values. We release the artifacts (Code: github.com/efficient-rlhf. PromptOpinSumm: hf.co/prompt-opin-summ. OpinPref: hf.co/opin-pref) for usage under MIT License.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Aspect-controllable opinion summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6578–6593, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  2. Reinald Kim Amplayo and Mirella Lapata. 2020. Unsupervised opinion summarization with noising and denoising. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1934–1945, Online. Association for Computational Linguistics.
  3. A general language assistant as a laboratory for alignment.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, abs/2204.05862.
  5. Constitutional ai: Harmlessness from ai feedback.
  6. Prompted opinion summarization with GPT-3.5. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9282–9300, Toronto, Canada. Association for Computational Linguistics.
  7. Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
  8. Unsupervised opinion summarization as copycat-review generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5151–5169, Online. Association for Computational Linguistics.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Eric Chu and Peter J. Liu. 2018. Meansum: A neural model for unsupervised multi-document abstractive summarization. In International Conference on Machine Learning.
  11. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5988–6008. PMLR.
  12. Improving alignment of dialogue agents via targeted human judgements.
  13. Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, page 507–517, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  14. Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, page 168–177, New York, NY, USA. Association for Computing Machinery.
  15. Self-supervised multimodal opinion summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 388–403, Online. Association for Computational Linguistics.
  16. Mistral 7b.
  17. Can machines learn morality? the delphi experiment.
  18. Aligning large language models through synthetic feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13677–13700, Singapore. Association for Computational Linguistics.
  19. Rlaif: Scaling reinforcement learning from human feedback with ai feedback.
  20. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  21. Aligning generative language models with human values. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 241–252, Seattle, United States. Association for Computational Linguistics.
  22. R.D. Luce. 2012. Individual Choice Behavior: A Theoretical Analysis. Dover Books on Mathematics. Dover Publications.
  23. Webgpt: Browser-assisted question-answering with human feedback. ArXiv, abs/2112.09332.
  24. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  25. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  26. R. L. Plackett. 1975. The analysis of permutations. Journal of the Royal Statistical Society. Series C (Applied Statistics), 24(2):193–202.
  27. Direct preference optimization: Your language model is secretly a reward model.
  28. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
  29. Proximal policy optimization algorithms.
  30. Synthesize, if you do not have: Effective synthetic dataset creation strategies for self-supervised opinion summarization in E-commerce. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13480–13491, Singapore. Association for Computational Linguistics.
  31. Aspect-sentiment-based opinion summarization using multiple information sources. In Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD), Mumbai, India, January 4-7, 2023, pages 55–61. ACM.
  32. Value kaleidoscope: Engaging ai with pluralistic human values, rights, and duties.
  33. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  34. Llama: Open and efficient foundation language models.
  35. Llama 2: Open foundation and fine-tuned chat models.
  36. Zephyr: Direct distillation of lm alignment.
  37. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  38. Fine-tuning language models from human preferences. ArXiv, abs/1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Swaroop Nath (5 papers)
  2. Tejpalsingh Siledar (5 papers)
  3. Sankara Sri Raghava Ravindra Muddu (3 papers)
  4. Rupasai Rangaraju (4 papers)
  5. Harshad Khadilkar (29 papers)
  6. Pushpak Bhattacharyya (153 papers)
  7. Suman Banerjee (66 papers)
  8. Amey Patil (5 papers)
  9. Sudhanshu Shekhar Singh (4 papers)
  10. Muthusamy Chelliah (8 papers)
  11. Nikesh Garera (13 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com