Mechanisms of Success in Multi-Agent Debate (MAD)

Determine the exact mechanisms and conditions under which Multi-Agent Debate (MAD) with large language models succeeds, distinguishing whether performance gains arise primarily from scaling test-time compute or from emergent capabilities produced by specific combinations of agent personas, response generators, discussion paradigms, and decision protocols.

Background

The paper introduces MALLM, a modular framework to systematically study Multi-Agent Debate (MAD) by varying agent personas, response generators, discussion paradigms, and decision protocols. The authors note that although MAD has shown promise, the underlying reasons for its success are not yet clearly understood.

They highlight competing hypotheses: MAD may effectively scale test-time compute, or it may exhibit emergent capabilities through particular combinations of components. Clarifying these mechanisms requires controlled experimentation that isolates variables, which MALLM is designed to enable.

References

"Yet, we have not understood the exact mechanisms of when and why MAD is successful. Different hypotheses exist around whether MAD is another way to scale test-time compute, or whether the combination of individual components has emergent capabilities."

— MALLM: Multi-Agent Large Language Models Framework (2509.11656 - Becker et al., 15 Sep 2025) in Section 1, Introduction

Mechanisms of Success in Multi-Agent Debate (MAD)

Background

References

Related Problems