TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning (2403.08694v4)
Abstract: The development of LLMs often confronts challenges stemming from the heavy reliance on human annotators in the reinforcement learning with human feedback (RLHF) framework, or the frequent and costly external queries tied to the self-instruct paradigm. In this work, we pivot to Reinforcement Learning (RL) -- but with a twist. Diverging from the typical RLHF, which refines LLMs following instruction data training, we use RL to directly generate the foundational instruction dataset that alone suffices for fine-tuning. Our method, TeaMs-RL, uses a suite of textual operations and rules, prioritizing the diversification of training datasets. It facilitates the generation of high-quality data without excessive reliance on external advanced models, paving the way for a single fine-tuning step and negating the need for subsequent RLHF stages. Our findings highlight key advantages of our approach: reduced need for human involvement and fewer model queries (only 5.73% of the strong baseline's total), along with enhanced capabilities of LLMs in crafting and comprehending complex instructions compared to strong baselines, and substantially improved model privacy protection. Code is available at the link: https://github.com/SafeRL-Lab/TeaMs-RL
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Payal Dhar. 2020. The carbon impact of artificial intelligence. Nat. Mach. Intell., 2(8):423–425.
- Privacy for free: How does dataset condensation help privacy? In International Conference on Machine Learning, pages 5378–5396. PMLR.
- Drive like a human: Rethinking autonomous driving with large language models. arXiv preprint arXiv:2307.07162.
- Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770.
- Practical membership inference attacks against large-scale multi-modal models: A pilot study. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4871–4881.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Robots that ask for help: Uncertainty alignment for large language model planners. arXiv preprint arXiv:2307.01928.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR.
- Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
- Domain knowledge distillation from large language model: An empirical study in the autonomous driving domain. arXiv preprint arXiv:2307.11769.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813.
- WizardLM: Empowering large pre-trained language models to follow complex instructions. In The Twelfth International Conference on Learning Representations.
- Enhanced membership inference attacks against machine learning models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 3093–3106.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- A preliminary study of the intrinsic relationship between complexity and alignment. arXiv preprint arXiv:2308.05696.