A transfer learning framework for weak-to-strong generalization (2405.16236v2)
Abstract: Modern LLM alignment techniques rely on human feedback, but it is unclear whether these techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unknown if it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using feedback from a weaker (less capable) model to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept prior from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach in multiple LLM alignment tasks.
- Falcon-40B: an open large language model with state-of-the-art performance. 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022a.
- Constitutional AI: Harmlessness from AI Feedback, Dec. 2022b.
- Y. Bengio. Faq on catastrophic ai risks, Jun 2023. URL https://yoshuabengio.org/2023/06/24/faq-on-catastrophic-ai-risks/.
- A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 19–26, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558607781.
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, pages 92–100, New York, NY, USA, July 1998. Association for Computing Machinery. ISBN 978-1-58113-057-7. doi: 10.1145/279943.279962.
- Measuring progress on scalable oversight for large language models, 2022.
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs], June 2020.
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision, Dec. 2023a.
- Weak-to-strong generalization, Dec. 2023b.
- Discovering Latent Knowledge in Language Models Without Supervision. In International Conference on Learning Representations, Feb. 2023c.
- T. T. Cai and H. Wei. Transfer Learning for Nonparametric Classification: Minimax Rate and Adaptive Classifier. arXiv:1906.02903 [cs, math, stat], June 2019.
- Semi-Supervised Learning. Adaptive Computation and Machine Learning. MIT Press, Cambridge, Mass, 2006. ISBN 978-0-262-03358-9.
- Deep reinforcement learning from human preferences. arXiv:1706.03741 [cs, stat], June 2017.
- Eliciting latent knowledge: How to tell if your eyes deceive you. Technical report, Alignment Research Center, 12 2021.
- Scaling instruction-finetuned language models, 2022.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers, 2023.
- Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, page 193–200, New York, NY, USA, 2007. Association for Computing Machinery. ISBN 9781595937933. doi: 10.1145/1273496.1273521. URL https://doi.org/10.1145/1273496.1273521.
- A survey on in-context learning, 2023.
- Truthful ai: Developing and governing ai that does not lie, 2021.
- J. Foulds and E. Frank. A review of multi-instance learning assumptions. The Knowledge Engineering Review, 25(1):1–25, 2010. doi: 10.1017/S026988890999035X.
- Self-ensembling for visual domain adaptation. arXiv:1706.05208 [cs], Sept. 2018.
- Co-teaching: Robust training of deep neural networks with extremely noisy labels, 2018.
- The unreasonable effectiveness of easy training data for hard tasks, 2024.
- Using trusted data to train deep networks on labels corrupted by severe noise, 2019.
- Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2006. URL https://proceedings.neurips.cc/paper_files/paper/2006/file/a2186aa7c086b46ad4e8bf81e2a3a19b-Paper.pdf.
- Aligner: Achieving efficient alignment through weak-to-strong correction, 2024.
- A survey of reinforcement learning from human feedback, 2023.
- S. Kpotufe and G. Martinet. Marginal Singularity, and the Benefits of Labels in Covariate-Shift. arXiv:1803.01833 [cs, stat], Mar. 2018.
- S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJ6oOfqge.
- Dividemix: Learning with noisy labels as semi-supervised learning, 2020.
- Convex and scalable weakly labeled svms, 2013.
- Detecting and Correcting for Label Shift with Black Box Predictors. In Proceedings of the 35th International Conference on Machine Learning, pages 3122–3130. PMLR, July 2018.
- Tuning language models by proxy, 2024.
- A computationally efficient classification algorithm in posterior drift model: Phase transition and minimax adaptivity, 2020.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
- Normalized loss functions for deep learning with noisy labels, 2020.
- Minimax optimal approaches to the label shift problem. arXiv:2003.10443 [math, stat], Apr. 2020.
- A linear adjustment based approach to posterior drift in transfer learning. arXiv:2111.10841 [stat], Dec. 2021.
- D. J. Miller and H. Uyar. A mixture of experts classifier with learning based on both labelled and unlabelled data. In M. Mozer, M. Jordan, and T. Petsche, editors, Advances in Neural Information Processing Systems, volume 9. MIT Press, 1996. URL https://proceedings.neurips.cc/paper_files/paper/1996/file/a58149d355f02887dfbe55ebb2b64ba3-Paper.pdf.
- Aligners: Decoupling llms and alignment, 2024.
- OpenAI. Introducing superalignment. https://openai.com/blog/introducing-superalignment. Accessed: 2024-04-27.
- Gpt-4 technical report, 2024.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, Oct. 2022.
- Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies, 2023.
- S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. doi: 10.1109/TKDE.2009.191.
- tinybenchmarks: evaluating llms with fewer examples. arXiv preprint arXiv:2402.14992, 2024.
- Self-critiquing models for assisting human evaluators, June 2022.
- A DIRT-T Approach to Unsupervised Domain Adaptation. In International Conference on Learning Representations, Feb. 2018.
- Learning from Noisy Labels with Deep Neural Networks: A Survey, Mar. 2022.
- Learning to summarize from human feedback, 2022.
- Bayesian transfer learning, 2023.
- Easy-to-hard generalization: Scalable alignment beyond human supervision, 2024.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Transformers learn in-context by gradient descent, Dec. 2022.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=gEZrGCozdqR.
- Self-training with noisy student improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2020. doi: 10.1109/CVPR42600.2020.01070.
- An Explanation of In-context Learning as Implicit Bayesian Inference. arXiv:2111.02080 [cs], Nov. 2021.
- K. Yi and J. Wu. Probabilistic end-to-end noise correction for learning with noisy labels, 2019.
- Multi-source domain adaptation: A causal view. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pages 3150–3157, Austin, Texas, Jan. 2015. AAAI Press. ISBN 978-0-262-51129-2.
- T. Zhang. Mathematical Analysis of Machine Learning Algorithms. Cambridge University Press, 2023.
- Z. Zhang and M. R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels, 2018.
- Calibrate before use: Improving few-shot performance of language models. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zhao21c.html.
- Learning with local and global consistency. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003. URL https://proceedings.neurips.cc/paper_files/paper/2003/file/87682805257e619d49b8e0dfdc14affa-Paper.pdf.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.
- Semi-supervised learning using Gaussian fields and harmonic functions. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, pages 912–919, Washington, DC, USA, Aug. 2003. AAAI Press. ISBN 978-1-57735-189-4.
- A comprehensive survey on transfer learning, 2020.
- Z.-H. Zou. A brief introdution to weakly supervised learning. National Science Review, 2018.
- A fixed-point approach to barycenters in wasserstein space, 2016.
- Seamus Somerstep (10 papers)
- Felipe Maia Polo (18 papers)
- Moulinath Banerjee (45 papers)
- Ya'acov Ritov (34 papers)
- Mikhail Yurochkin (68 papers)
- Yuekai Sun (62 papers)