Identify Improvement-Needed Components in LLM Multi-Agent Systems

Determine which specific components of LLM-powered multi-agent systems require improvement based on benchmark evaluation results, i.e., identify the parts of the system that directly lead to task failures and thus warrant refinement.

Background

The paper argues that current practice relies on increasingly fine-grained benchmarks to evaluate LLM multi-agent systems, yet mapping benchmark outcomes to concrete system components needing improvement remains largely manual and labor-intensive. As systems grow in complexity, practitioners face greater difficulty in attributing failures to particular agents or steps, making targeted refinement challenging.

To bridge this gap, the authors propose automated failure attribution and introduce the Who{content}When dataset to support research on identifying failure-responsible agents and decisive error steps. The open question motivates their focus on automating the link between evaluation results and actionable system diagnostics.

References

With increasingly comprehensive benchmarks, a fundamental question remains unanswered: which components of the agentic system require improvement?

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems (2505.00212 - Zhang et al., 30 Apr 2025) in Section 1 (Introduction)