Fusing Models with Complementary Expertise (2310.01542v2)
Abstract: Training AI models that generalize across tasks and domains has long been among the open problems driving AI research. The emergence of Foundation Models made it easier to obtain expert models for a given task, but the heterogeneity of data that may be encountered at test time often means that any single expert is insufficient. We consider the Fusion of Experts (FoE) problem of fusing outputs of expert models with complementary knowledge of the data distribution and formulate it as an instance of supervised learning. Our method is applicable to both discriminative and generative tasks and leads to significant performance improvements in image and text classification, text summarization, multiple-choice QA, and automatic evaluation of generated text. We also extend our method to the "frugal" setting where it is desired to reduce the number of expert model evaluations at test time. Our implementation is publicly available at https://github.com/hwang595/FoE-ICLR2024.
- The inclusive images competition. In The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations, pp. 155–186. Springer, 2020.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- On the Opportunities and Risks of Foundation Models. arXiv:2108.07258 [cs], August 2021.
- Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996a.
- Leo Breiman. Stacked regressions. Machine learning, 24:49–64, 1996b.
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs], June 2020.
- Generalized negative correlation learning for deep ensembling. arXiv preprint arXiv:2011.02952, 2020.
- Confidence-based model selection: When to take shortcuts for subpopulation shifts. arXiv preprint arXiv:2306.11120, 2023a.
- FrugalML: How to use ML prediction APIs more accurately and cheaply. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, pp. 10685–10696, Red Hook, NY, USA, December 2020. Curran Associates Inc. ISBN 978-1-71382-954-6.
- Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023b.
- Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530, 2023.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Wikisum: Coherent summarization dataset for efficient human evaluation. 2021.
- Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
- Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409, 2021.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Experiments with a new boosting algorithm. In icml, volume 96, pp. 148–156. Citeseer, 1996.
- Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232, 2001. ISSN 0090-5364, 2168-8966. doi: 10.1214/aos/1013203451.
- Specializing Smaller Language Models towards Multi-Step Reasoning. In Proceedings of the 40th International Conference on Machine Learning, pp. 10421–10430. PMLR, July 2023.
- Ensemble deep learning: A review. Engineering Applications of Artificial Intelligence, 115:105151, 2022.
- Selective Classification for Deep Neural Networks. arXiv:1705.08500 [cs], June 2017.
- Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283, 2018.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- On calibration of modern neural networks. In International conference on machine learning, pp. 1321–1330. PMLR, 2017.
- Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10):993–1001, 1990.
- The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
- Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Las Vegas, NV, USA, June 2016. IEEE. ISBN 978-1-4673-8851-1. doi: 10.1109/CVPR.2016.90.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
- Exploring the benefits of training expert language models over instruction tuning. arXiv preprint arXiv:2302.03202, 2023.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14165–14178, Toronto, Canada, jul 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.792. URL https://aclanthology.org/2023.acl-long.792.
- Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214, 1994.
- Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
- WILDS: A Benchmark of in-the-Wild Distribution Shifts. arXiv:2012.07421 [cs], December 2020.
- Learning multiple layers of features from tiny images. 2009.
- Neural network ensembles, cross validation, and active learning. Advances in neural information processing systems, 7, 1994.
- Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. February 2022. doi: 10.48550/arXiv.2202.10054.
- Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
- Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
- Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
- Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 2003 human language technology conference of the North American chapter of the association for computational linguistics, pp. 150–157, 2003.
- Ensemble distillation for robust model fusion in federated learning. Advances in Neural Information Processing Systems, 33:2351–2363, 2020.
- Detecting and Correcting for Label Shift with Black Box Predictors. arXiv:1802.03916 [cs, stat], July 2018.
- Ensemble learning via negative correlation. Neural networks, 12(10):1399–1404, 1999.
- Inderjeet Mani. Automatic summarization, volume 3. John Benjamins Publishing, 2001.
- Mixture of experts: a literature survey. Artificial Intelligence Review, 42:275–293, 2014.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1906–1919, Online, 2020.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- Ibrahim Naji. TSATC: Twitter Sentiment Analysis Training Corpus. In thinknook, 2012.
- Understanding softmax confidence and uncertainty. arXiv preprint arXiv:2106.04972, 2021.
- Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.
- Summareranker: A multi-task mixture-of-experts re-ranking framework for abstractive summarization. arXiv preprint arXiv:2203.06569, 2022.
- Out-of-distribution detection and selective generation for conditional language models. In International Conference on Learning Representations, 2023.
- Stuart J Russell. Artificial intelligence a modern approach. Pearson Education, Inc., 2010.
- BREEDS: Benchmarks for Subpopulation Shift. August 2020. doi: 10.48550/arXiv.2008.04859.
- On the evaluation metrics for paraphrase generation. arXiv preprint arXiv:2202.08479, 2022.
- Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624, 2021.
- Investigating societal biases in a poetry composition system. arXiv preprint arXiv:2011.02686, 2020.
- Crowd counting with deep negative correlation learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5382–5390, 2018.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Asking and answering questions to evaluate the factual consistency of summaries. arXiv preprint arXiv:2004.04228, 2020a.
- Federated learning with matched averaging. In International Conference on Learning Representations, 2020b.
- A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
- David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
- Mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations, February 2018.
- Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, pp. 11328–11339. PMLR, 2020.
- Nonlinear regression via deep negative correlation learning. IEEE transactions on pattern analysis and machine intelligence, 43(3):982–998, 2019a.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019b.
- Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197, 2022.