Theoretical Analysis of Weak-to-Strong Generalization (2405.16043v1)
Abstract: Strong student models can learn from weaker teachers: when trained on the predictions of a weaker model, a strong pretrained student can learn to correct the weak model's errors and generalize to examples where the teacher is not confident, even when these examples are excluded from training. This enables learning from cheap, incomplete, and possibly incorrect label information, such as coarse logical rules or the generations of a LLM. We show that existing weak supervision theory fails to account for both of these effects, which we call pseudolabel correction and coverage expansion, respectively. We give a new bound based on expansion properties of the data distribution and student hypothesis class that directly accounts for pseudolabel correction and coverage expansion. Our bounds capture the intuition that weak-to-strong generalization occurs when the strong model is unable to fit the mistakes of the weak teacher without incurring additional error. We show that these expansion properties can be checked from finite data and give empirical evidence that they hold in practice.
- Generalization on the unseen, logic reasoning and degree curriculum. In International Conference on Machine Learning, pages 31–60. PMLR.
- Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998–2022.
- Co-training and expansion: Towards bridging theory and practice. Advances in neural information processing systems, 17.
- Analysis of representations for domain adaptation. Advances in neural information processing systems, 19.
- Learning bounds for domain adaptation. Advances in neural information processing systems, 20.
- Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pages 92–100.
- The complexity of testing whether a graph is a superconcentrator. Information Processing Letters, 13(4-5):164–167.
- Introduction to statistical learning theory. In Summer school on machine learning, pages 169–207. Springer.
- Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
- Disambiguation of weak supervision leading to exponential convergence rates. In International Conference on Machine Learning, pages 1147–1157. PMLR.
- A theory of label propagation for subpopulation shift. In International Conference on Machine Learning, pages 1170–1182. PMLR.
- Shoring up the foundations: Fusing model embeddings and weak supervision. In Uncertainty in Artificial Intelligence, pages 357–367. PMLR.
- Self-training avoids using spurious features under domain shift. Advances in Neural Information Processing Systems, 33:21061–21071.
- Unified risk analysis for weakly supervised learning. arXiv preprint arXiv:2309.08216.
- Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28.
- Is GPT-3 a good data annotator? In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11173–11195, Toronto, Canada. Association for Computational Linguistics.
- Honest students from untrusted teachers: Learning an interpretable question-answering pipeline from a pretrained language model. In Workshop on Trustworthy and Socially Responsible Machine Learning, NeurIPS 2022.
- Self-training converts weak learners to strong learners in mixture models. In International Conference on Artificial Intelligence and Statistics, pages 8003–8021. PMLR.
- Weakly supervised classification of aortic valve malformations using unlabeled cardiac mri sequences. Nature communications, 10(1):3111.
- Fast and three-rious: Speeding up weak supervision with triplet methods. In International Conference on Machine Learning, pages 3280–3291. PMLR.
- Provable guarantees for self-supervised deep learning with spectral contrastive loss. Advances in Neural Information Processing Systems, 34:5000–5011.
- Beyond separability: Analyzing the linear transferability of contrastive representations to related subpopulations. Advances in neural information processing systems, 35:26889–26902.
- Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854.
- Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597.
- Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317.
- Blocking conductance and mixing in random walks. Combinatorics, Probability and Computing, 15(4):541–570.
- Self-training with weak supervision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 845–863.
- Iterative learning for reliable crowdsourcing systems. Advances in neural information processing systems, 24.
- Learning from noisy singly-labeled data. In International Conference on Learning Representations.
- Detecting change in data streams. In VLDB, volume 4, pages 180–191. Toronto, Canada.
- Understanding self-training for gradual domain adaptation. In International conference on machine learning, pages 5468–5479. PMLR.
- Chatgpt: Beginning of an end of manual linguistic data annotation? use case of automatic genre identification. arXiv preprint arXiv:2303.03953.
- Improved cheeger’s inequality and analysis of local graph partitioning using vertex expansion and expansion profile. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1848–1861. SIAM.
- Co-training improves prompt-based learning for large language models. In International Conference on Machine Learning, pages 11985–12003. PMLR.
- Self-supervised self-supervision by combining deep learning and probabilistic logic. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4978–4986.
- Training subset selection for weak supervision. Advances in Neural Information Processing Systems, 35:16023–16036.
- Characterizing the impacts of semi-supervised learning for weak supervision. In Thirty-seventh Conference on Neural Information Processing Systems.
- Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
- Decoupled weight decay regularization. In International Conference on Learning Representations.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA. Association for Computational Linguistics.
- Higher-order cheeger inequality for partitioning with buffers. arXiv preprint arXiv:2308.10160.
- Weakly-supervised neural text classification. In proceedings of the 27th ACM International Conference on information and knowledge management, pages 983–992.
- MTEB: Massive text embedding benchmark. In Vlachos, A. and Augenstein, I., editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014–2037, Dubrovnik, Croatia. Association for Computational Linguistics.
- Learning with noisy labels. Advances in neural information processing systems, 26.
- Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411.
- A theoretical characterization of semi-supervised learning with self-training for gaussian mixture models. In Banerjee, A. and Fukumizu, K., editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 3601–3609. PMLR.
- Label propagation with weak supervision. In The Eleventh International Conference on Learning Representations.
- Snorkel: Rapid training data creation with weak supervision. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 11, page 269. NIH Public Access.
- Training complex models with multi-task weak supervision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4763–4771.
- Data programming: Creating large training sets, quickly. Advances in neural information processing systems, 29.
- Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Inui, K., Jiang, J., Ng, V., and Wan, X., editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Strength from weakness: Fast learning using weak supervision. In International Conference on Machine Learning, pages 8127–8136. PMLR.
- End-to-end weak supervision. Advances in Neural Information Processing Systems, 34:1845–1857.
- Losses over labels: Weakly supervised learning via direct loss construction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9695–9703.
- Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147.
- Shelah, S. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41(1):247–261.
- Does knowledge distillation really work? Advances in Neural Information Processing Systems, 34:6906–6919.
- Machine Learning from Weak Supervision: An Empirical Risk Minimization Approach. MIT Press.
- Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166.
- Vapnik, V. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–281.
- Snuba: Automating weak supervision to label training data. In Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 12, page 223. NIH Public Access.
- Learning dependency structures for weak supervision models. In International Conference on Machine Learning, pages 6418–6427. PMLR.
- Prevalence and prevention of large language model use in crowd work.
- Want to reduce labeling cost? gpt-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
- A clinical text classification paradigm using weak supervision and deep representation. BMC medical informatics and decision making, 19:1–13.
- Improved sample complexities for deep neural networks and robust classification via an all-layer margin. In International Conference on Learning Representations.
- Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations.
- Fine-tuning pre-trained language model with weak supervision: A contrastive-regularized self-training approach. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1063–1077.
- A survey on programmatic weak supervision. arXiv preprint arXiv:2202.05433.
- Understanding programmatic weak supervision via source-aware influence function. Advances in Neural Information Processing Systems, 35:2862–2875.
- WRENCH: A comprehensive benchmark for weak supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- W2f: A weakly-supervised to fully-supervised framework for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 928–936.
- Hunter Lang (19 papers)
- David Sontag (95 papers)
- Aravindan Vijayaraghavan (46 papers)