Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration (2404.12715v2)

Published 19 Apr 2024 in cs.CL

Abstract: LLMs exhibit complementary strengths in various tasks, motivating the research of LLM ensembling. However, existing work focuses on training an extra reward model or fusion model to select or combine all candidate answers, posing a great challenge to the generalization on unseen data distributions. Besides, prior methods use textual responses as communication media, ignoring the valuable information in the internal representations. In this work, we propose a training-free ensemble framework DeePEn, fusing the informative probability distributions yielded by different LLMs at each decoding step. Unfortunately, the vocabulary discrepancy between heterogeneous LLMs directly makes averaging the distributions unfeasible due to the token misalignment. To address this challenge, DeePEn maps the probability distribution of each model from its own probability space to a universal relative space based on the relative representation theory, and performs aggregation. Next, we devise a search-based inverse transformation to transform the aggregated result back to the probability space of one of the ensembling LLMs (main model), in order to determine the next token. We conduct extensive experiments on ensembles of different number of LLMs, ensembles of LLMs with different architectures, and ensembles between the LLM and the specialist model. Experimental results show that (i) DeePEn achieves consistent improvements across six benchmarks covering subject examination, reasoning, and knowledge, (ii) a well-performing specialist model can benefit from a less effective LLM through distribution fusion, and (iii) DeePEn has complementary strengths with other ensemble methods such as voting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.
  2. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  3. Training verifiers to solve math word problems.
  4. Improving factuality and reasoning in language models through multiagent debate.
  5. Ekaterina Garmash and Christof Monz. 2016. Ensemble learning for multi-source neural machine translation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1409–1418, Osaka, Japan. The COLING 2016 Organizing Committee.
  6. Bobby He and Mete Ozay. 2022. Feature kernel distillation. In International Conference on Learning Representations.
  7. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  8. Mistral 7b.
  9. LLM-blender: Ensembling large language models with pairwise ranking and generative fusion. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14165–14178, Toronto, Canada. Association for Computational Linguistics.
  10. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  11. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  12. Routing to the expert: Efficient reward-guided ensemble of large language models.
  13. Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations.
  14. OpenAI. 2023. Gpt-4 technical report.
  15. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3967–3976.
  16. Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1249.
  17. Large language model routing with benchmark datasets.
  18. Llama 2: Open foundation and fine-tuned chat models.
  19. Fusing models with complementary expertise. In The Twelfth International Conference on Learning Representations.
  20. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  21. David H. Wolpert. 1992. Stacked generalization. Neural Networks, 5(2):241–259.
  22. A survey of large language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yichong Huang (13 papers)
  2. Xiaocheng Feng (54 papers)
  3. Baohang Li (8 papers)
  4. Yang Xiang (187 papers)
  5. Hui Wang (371 papers)
  6. Bing Qin (186 papers)
  7. Ting Liu (329 papers)
Citations (11)