Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models (2402.01620v2)

Published 2 Feb 2024 in cs.CL
MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

Abstract: Multi-agent interactions between LLM agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured distillation of the reasoning interactions between multiple LLMs into smaller LMs. MAGDi teaches smaller models by representing multi-agent interactions as graphs, augmenting a base student model with a graph encoder, and distilling knowledge using three objective functions: next-token prediction, a contrastive loss between correct and incorrect reasoning, and a graph-based objective to model the interaction structure. Experiments on seven widely used commonsense and math reasoning benchmarks show that MAGDi improves the reasoning capabilities of smaller models, outperforming several methods that distill from a single teacher and multiple teachers. Moreover, MAGDi also demonstrates an order of magnitude higher efficiency over its teachers. We conduct extensive analyses to show that MAGDi (1) enhances the generalizability to out-of-domain tasks, (2) scales positively with the size and strength of the base student model, and (3) obtains larger improvements (via our multi-teacher training) when applying self-consistency -- an inference technique that relies on model diversity.

Introduction

LLMs have played a crucial role in enhancing performance on reasoning tasks. Yet their benefits are typically offset by high computational costs due to extensive generation requirements and multiple model instances interacting over several rounds. A pressing issue is the lack of a final, efficient model for inference, as multi-agent frameworks do not consolidate reasoning skills into a standalone model.

Multi-Agent Distillation

To combat these challenges, this paper introduces Multi-Agent Interaction Graphs Distillation (MAGD I). This novel approach structures the distillation of reasoning interactions from numerous LLMs into more compact LLMs. It employs a graph encoder within a student model and distills knowledge using three tailored objective functions. These include next-token prediction, contrastive loss between correct/incorrect reasoning, and a graph-based objective to encompass interaction structures.

Experimentation and Results

The efficacy of MAGD I has been rigorously tested on seven prominent reasoning benchmarks. The evaluations show that the methodology not only improves smaller models' reasoning abilities substantially but also maintains operationally efficiency levels that are an order of magnitude better than the multi-agent teacher setups. For instance, MAGD I-distilled models reduce the token generation by up to 9x at inference time, while surpassing all single-teacher distillation baselines in performance.

Scalability and Generalizability

Further exploration of MAGD I's applications showcases that its benefits carry over to generalizability and scalability across various domains and model sizes. When utilized to construct a universal multi-task learning model, MAGD I performs comparably on multiple tasks simultaneously and exhibits competence even on out-of-domain tasks. Moreover, this method scales positively with the underlying student model's size and sophistication, indicating its long-term applicability as foundational models evolve.

Diversity and Inference Techniques

MAGD I also has the potential to enhance model diversity, which is demonstrated through the method's compatibility with self-consistency inference techniques that depend on varied model outputs. The student models, trained via MAGD I, achieve notable performance jumps when used in combination with such ensemble methods, suggesting that structured distillation may imbue models with a richer response spectrum.

Conclusion

This paper posits the innovative MAGD I method as a solution to infuse LLMs' reasoning prowess into smaller models without incurring prohibitive computational expenses. The empirical results underscore the potential of structured distillation in creating efficient and robust reasoning models, capable of transfer learning and preserving diversity for advanced inference applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Justin Chih-Yao Chen (9 papers)
  2. Swarnadeep Saha (19 papers)
  3. Elias Stengel-Eskin (49 papers)
  4. Mohit Bansal (304 papers)
Citations (5)