Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts (2402.15505v1)

Published 23 Feb 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Steering the behavior of a strong model pre-trained on internet-scale data can be difficult due to the scarcity of competent supervisors. Recent studies reveal that, despite supervisory noises, a strong student model may surpass its weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of such weak-to-strong generalization remains limited, especially in the presence of large capability gaps. In this paper, we propose to address this challenge by harnessing a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student. Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision: (i) we progressively alternate student training and teacher assignment, leveraging the growth of the strong student to identify plausible supervisions; (ii) we conservatively enforce teacher-student and local-global consistency, leveraging their dependencies to reject potential annotation noises. We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Concrete Problems in AI Safety, arXiv:1606.06565, July 2016.
  2. Robot learning from demonstration. In ICML, volume 97, pp.  12–20. Citeseer, 1997.
  3. A Framework for Behavioural Cloning. In Machine Intelligence 15, pp.  103–129, 1995.
  4. Bayesian hierarchical mixtures of experts. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, UAI’03, pp.  57–64, San Francisco, CA, USA, August 2002. Morgan Kaufmann Publishers Inc. ISBN 978-0-12-705664-7.
  5. Breiman, L. Bagging predictors. Machine Learning, 24(2):123–140, August 1996.
  6. Video generation models as world simulators. https://openai.com/research/video-generation-models-as-world-simulators, 2024.
  7. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901, 2020.
  8. Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision, arXiv:2312.09390, December 2023.
  9. CAIS. Statement on AI risk, 2023.
  10. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  11. Understanding and Utilizing Deep Neural Networks Trained with Noisy Labels. In Proceedings of the 36th International Conference on Machine Learning, pp.  1062–1070. PMLR, May 2019.
  12. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018.
  13. Deep Reinforcement Learning from Human Preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  14. Embedded agency. arXiv preprint arXiv:1902.09469, 2019.
  15. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, June 2009.
  16. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, October 2020.
  17. A Review of Sparse Expert Models in Deep Learning, arXiv:2209.01667, September 2022.
  18. Evaluating Superhuman Models with Consistency Checks, arXiv:2306.09983, October 2023.
  19. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning, ICML’96, pp.  148–156, San Francisco, CA, USA, July 1996. Morgan Kaufmann Publishers Inc. ISBN 978-1-55860-419-3.
  20. Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115:105151, October 2022.
  21. Noisy-label Learning with Sample Selection based on Noise Rate Estimate, arXiv:2305.19486, May 2023.
  22. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  23. Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  24. Bayesian Model Averaging: A Tutorial. Statistical Science, 14(4):382–401, 1999.
  25. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019.
  26. AI safety via debate. arXiv preprint arXiv:1805.00899, 2018.
  27. Adaptive Mixtures of Local Experts. Neural Computation, 3(1):79–87, March 1991.
  28. AI Alignment: A Comprehensive Survey, arXiv:2310.19852, November 2023.
  29. Mixtral of Experts, arXiv:2401.04088, January 2024.
  30. MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels. In Proceedings of the 35th International Conference on Machine Learning, pp.  2304–2313. PMLR, July 2018.
  31. Hierarchical mixtures of experts and the EM algorithm. In Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), volume 2, pp.  1339–1344 vol.2, October 1993.
  32. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012a.
  33. ImageNet Classification with Deep Convolutional Neural Networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 25, pp.  1097–1105. Curran Associates, Inc., 2012b.
  34. Introducing Superalignment. OpenAI Blog, 2023.
  35. Scalable agent alignment via reward modeling: A research direction, arXiv:1811.07871, November 2018.
  36. DINOv2: Learning Robust Visual Features without Supervision, arXiv:2304.07193, February 2024.
  37. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, October 2022.
  38. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  39. Moment Matching for Multi-Source Domain Adaptation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp.  1406–1415. IEEE Computer Society, October 2019. ISBN 978-1-72814-803-8.
  40. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. In Thirty-Seventh Conference on Neural Information Processing Systems, November 2023.
  41. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, pp.  8821–8831. PMLR, July 2021.
  42. Scaling Vision with Sparse Mixture of Experts. In Advances in Neural Information Processing Systems, volume 34, pp.  8583–8595. Curran Associates, Inc., 2021.
  43. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  44. Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5):1651–1686, October 1998.
  45. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. In International Conference on Learning Representations, November 2016.
  46. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  47. Llama 2: Open Foundation and Fine-Tuned Chat Models, arXiv:2307.09288, July 2023.
  48. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations, October 2021.
  49. Revisiting Knowledge Distillation via Label Smoothing Regularization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3902–3910, Seattle, WA, USA, June 2020. IEEE. ISBN 978-1-72817-168-5.
  50. Twenty Years of Mixture of Experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193, August 2012.
  51. Mixture-of-Experts with Expert Choice Routing. Advances in Neural Information Processing Systems, 35:7103–7114, December 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yuejiang Liu (14 papers)
  2. Alexandre Alahi (100 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com