Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploration with Principles for Diverse AI Supervision (2310.08899v2)

Published 13 Oct 2023 in cs.CL

Abstract: Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI. While this generative AI approach has produced impressive results, it heavily leans on human supervision. Even state-of-the-art AI models like ChatGPT depend on fine-tuning through human demonstrations, demanding extensive human input and domain expertise. This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation. To address this limitation, we propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data. Drawing inspiration from unsupervised reinforcement learning (RL) pretraining, EAI achieves exploration within the natural language space. We accomplish this by harnessing LLMs to assess the novelty of generated content. Our approach employs two key components: an actor that generates novel content following exploration principles and a critic that evaluates the generated content, offering critiques to guide the actor. Empirical evaluations demonstrate that EAI significantly boosts model performance on complex reasoning tasks, addressing the limitations of human-intensive supervision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude.
  2. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. J. Beirlant. Nonparametric entropy estimation: An overview. International Journal of the Mathematical Statistics Sciences, 6:17–39, 1997.
  5. J. Bilmes. Submodularity in machine learning and artificial intelligence. arXiv preprint arXiv:2202.00132, 2022.
  6. Beyond fine-tuning: Transferring behavior in reinforcement learning. arXiv preprint arXiv:2102.13515, 2021.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  8. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna.lmsys.org, 2023.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  11. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  12. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  13. Guiding pretraining in reinforcement learning with large language models. arXiv preprint arXiv:2302.06692, 2023.
  14. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070, 2018.
  15. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
  16. Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
  17. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  18. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  19. Urlb: Unsupervised reinforcement learning benchmark. arXiv preprint arXiv:2110.15191, 2021.
  20. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328, 2022.
  21. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  22. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  23. H. Liu and P. Abbeel. Aps: Active pre-training with successor features. In International Conference on Machine Learning, 2021a.
  24. H. Liu and P. Abbeel. Behavior from the void: Unsupervised active pre-training. In Advances in Neural Information Processing Systems, 2021b.
  25. Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 2023.
  26. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  27. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  28. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  29. K. P. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023. URL http://probml.github.io/book2.
  30. A policy gradient method for task-agnostic exploration. arXiv preprint arXiv:2007.04640, 2020.
  31. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations, 2022.
  32. OpenAI. Gpt-4 technical report, 2023.
  33. Controllability-aware unsupervised skill discovery. arXiv preprint arXiv:2302.05103, 2023.
  34. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning, pages 2778–2787. PMLR, 2017.
  35. Mastering the unsupervised reinforcement learning benchmark from pixels. 2023.
  36. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  37. Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. URL https://openai.com/blog/chatgpt.
  38. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  39. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4):301–321, 2003.
  40. Intrinsically motivated reinforcement learning: An evolutionary perspective. IEEE Transactions on Autonomous Mental Development, 2(2):70–82, 2010.
  41. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
  42. Stanford alpaca: An instruction-following llama model, 2023.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  44. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  45. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023.
  46. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  47. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  48. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  49. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023a.
  50. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023b.
  51. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  52. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.