Efficient Model-agnostic Alignment via Bayesian Persuasion (2405.18718v1)
Abstract: With recent advancements in LLMs, alignment has emerged as an effective technique for keeping LLMs consensus with human intent. Current methods primarily involve direct training through Supervised Fine-tuning (SFT) or Reinforcement Learning from Human Feedback (RLHF), both of which require substantial computational resources and extensive ground truth data. This paper explores an efficient method for aligning black-box large models using smaller models, introducing a model-agnostic and lightweight Bayesian Persuasion Alignment framework. We formalize this problem as an optimization of the signaling strategy from the small model's perspective. In the persuasion process, the small model (Advisor) observes the information item (i.e., state) and persuades large models (Receiver) to elicit improved responses. The Receiver then generates a response based on the input, the signal from the Advisor, and its updated belief about the information item. Through training using our framework, we demonstrate that the Advisor can significantly enhance the performance of various Receivers across a range of tasks. We theoretically analyze our persuasion framework and provide an upper bound on the Advisor's regret, confirming its effectiveness in learning the optimal signaling strategy. Our Empirical results demonstrates that GPT-2 can significantly improve the performance of various models, achieving an average enhancement of 16.1% in mathematical reasoning ability and 13.7% in code generation. We hope our work can provide an initial step toward rethinking the alignment framework from the Bayesian Persuasion perspective.
- Competing in the dark: An efficient algorithm for bandit linear optimization. In Conference on Learning Theory (COLT), pages 263–274. Citeseer, 2008.
- Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Private bayesian persuasion. Journal of Economic Theory, 182:185–217, 2019.
- Program synthesis with large language models. CoRR, abs/2108.07732, 2021.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Optimal rates and efficient algorithms for online bayesian persuasion. In International Conference on Machine Learning (ICML), pages 2164–2183. PMLR, 2023.
- Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022.
- Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
- Online bayesian persuasion. Advances in Neural Information Processing Systems (NeurIPS), 33:16188–16198, 2020.
- Evaluating large language models trained on code. CoRR, abs/2107.03374, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018.
- Arc’s first technical report: Eliciting latent knowledge. https://www.alignmentforum.org/posts/qHCDysDnvhteW7kRd/arc-s-first-technical-report-eliciting-latent-knowledge, 2021.
- Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
- WD Cook and RJ Webster. Caratheodory’s theorem. Canadian Mathematical Bulletin, 15(2):293–293, 1972.
- Measuring the persuasiveness of language models, 2024. URL https://www.anthropic.com/news/measuring-model-persuasiveness.
- Truthful ai: Developing and governing ai that does not lie. arXiv preprint arXiv:2110.06674, 2021.
- Brian J Fogg. Persuasive technology: using computers to change what we think and do. Ubiquity, 2002(December):2, 2002.
- Bayesian persuasion in sequential decision-making. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 36, pages 5025–5033, 2022.
- Measuring mathematical problem solving with the math dataset. In Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- Marius Hobbhahn. Eliciting latent knowledge (elk) - distillation/summary. https://www.alignmentforum.org/posts/rxoBY9CMkqDsHt25t/eliciting-latent-knowledge-elk-distillation-summary, 2022.
- Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
- Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852, 2023.
- Aligner: Achieving efficient alignment through weak-to-strong correction. arXiv preprint arXiv:2402.02416, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Emir Kamenica. Bayesian persuasion and information design. Annual Review of Economics, 11:249–272, 2019.
- Bayesian persuasion. American Economic Review, 101(6):2590–2615, 2011.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Co-supervised learning: Improving weak-to-strong generalization with hierarchical mixture of experts. arXiv preprint arXiv:2402.15505, 2024.
- The potential of generative ai for personalized persuasion at scale. Scientific Reports, 14(1):4692, 2024.
- Interior-point polynomial algorithms in convex programming. SIAM, 1994.
- OpenAI. Gpt-4 technical report, 2023a.
- OpenAI. Introducing superalignment. https://openai.com/blog/introducing-superalignment, 2023b. Accessed on July 5, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022.
- How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840, 2023.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (NeurIPS), 2023. URL https://openreview.net/forum?id=HPuSIXJaa9.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
- Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
- Easy-to-hard generalization: Scalable alignment beyond human supervision. CoRR, abs/2403.09472, 2024.
- Stanford alpaca: An instruction-following llama model, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Persuasion for good: Towards a personalized persuasive dialogue system for social good. arXiv preprint arXiv:1906.06725, 2019.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
- Incentive compatibility for ai alignment in sociotechnical systems: Positions and prospects. arXiv preprint arXiv:2402.12907, 2024.