Reinforcement learning for question answering in programming domain using public community scoring as a human feedback (2401.10882v1)
Abstract: In this study, we investigate the enhancement of the GPT Neo 125M performance in Community Question Answering (CQA) with a focus on programming, through the integration of Reinforcement Learning from Human Feedback (RLHF) and the utilization of scores from Stack Overflow. Two distinct reward model training strategies are employed for fine-tuning with Proximal Policy Optimization (PPO). Notably, the improvements in performance achieved through this method are comparable to those of GPT Neo 2.7B parameter variant. Additionally, an auxiliary scoring mechanism is introduced, which demonstrates the limitations of conventional linguistic metrics in evaluating responses in the programming domain. Through accurate analysis, this paper looks at the divergence between traditional linguistic metrics and our human-preferences-based reward model, underscoring the imperative for domain-specific evaluation methods. By elucidating the complexities involved in applying RLHF to programming CQA and accentuating the significance of context-aware evaluation, this study contributes to the ongoing efforts in refining LLMs through focused human feedback.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Stackllama: An rl fine-tuned llama model for stack exchange question and answering.
- What kind of questions do developers ask on stack overflow? a comparison of automated approaches to classify posts into question categories. Empirical Software Engineering, 25:2258–2301.
- Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. If you use this software, please cite it using these metadata, 58.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
- Human perceiving behavior modeling in evaluation of code generation models. In Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 287–294, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- A text abstraction summary model based on bert word embedding and reinforcement learning. Applied Sciences, 9(21):4701.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.