- The paper introduces an adaptive team-building paradigm using a Captain Agent to orchestrate dynamic multi-agent collaboration.
- It leverages retrieval-augmented generation and nested group conversations to assemble specialized teams tailored to evolving task requirements.
- Empirical evaluations reveal a 21.94% mean accuracy improvement over static teams, enhancing scalability, cost-effectiveness, and robustness.
Adaptive In-conversation Team Building for LLM Agents
Introduction
The paper "Adaptive In-conversation Team Building for LLM Agents" (2405.19425) addresses the challenge of constructing effective multi-agent systems based on LLMs for complex task-solving. The authors critique the prevailing static team-building paradigm, which predefines agent teams before task execution, and propose an adaptive approach that dynamically assembles and manages agent teams during the problem-solving process. The core contribution is the Captain Agent, an adaptive builder agent that orchestrates team formation, nested group conversations, and reflection, enabling flexible, context-sensitive collaboration among LLM agents.
Adaptive Team-Building Paradigm
Motivation and Limitations of Static Teams
Static team-building, where all agents are selected prior to task execution, suffers from scalability and adaptability issues. As task complexity increases, static teams require a large number of agents to cover all possible expertise, leading to context length limitations, management overhead, and reduced conversational quality due to irrelevant or redundant agent participation. Static teams also lack the flexibility to respond to evolving task requirements or unforeseen challenges during execution.
Captain Agent: Architecture and Workflow
The Captain Agent implements an adaptive team-building paradigm with two principal components:
- Adaptive Multi-agent Team Building: For each subtask, Captain Agent identifies required roles, retrieves or generates suitable agents and tools, and assembles a specialized team. This process leverages retrieval-augmented generation (RAG) using sentence embeddings (e.g., all-mpnet-base-v2) for semantic matching between role descriptions and agent/tool profiles. If no suitable agent is found, a new agent is generated with a tailored system message, incorporating both general and task-specific instructions.
- Nested Group Conversation and Reflection: The assembled team engages in a group chat, managed by the AutoGen framework, to collaboratively solve the subtask. Tool usage is integrated via free-form code execution, with results fed back into the conversation. A reflector LLM reviews the conversation, flags contradictions or issues, and provides a reflection report. If inconsistencies are detected, Captain Agent initiates a verification process with a new or modified team.
This cyclical process—plan, build team, solve subtask, reflect, and adapt—continues until the overall task is completed.
Implementation Details
- Agent Library: Populated by running Captain Agent on a subset of problems, storing generated agents with detailed profiles. The library also includes hand-crafted agents from frameworks like AutoGen.
- Tool Library: Comprises callable Python functions for math, data analysis, and information retrieval, designed to match dataset patterns and enhance agent capabilities.
Retrieval and Selection
- Retrieval: For each role, top-k agents and tools are retrieved from the libraries based on cosine similarity of sentence embeddings.
- Selection: An LLM-based agent selector matches roles to agents, with an abstention mechanism to avoid forced, irrelevant assignments.
- Generation: For unmatched roles, new agents are generated with system messages combining role-specific, general, and group chat instructions.
Nested Conversation
- Group Chat Management: AutoGen manages turn-taking and context, with agents executing code and tool calls in a shared environment.
- Reflection: A reflector LLM summarizes the conversation, identifies contradictions, and determines if further verification is needed.
Cost and Model Diversity
The approach incurs higher computational cost than single-agent systems due to increased context and agent participation. However, adaptive team-building reduces unnecessary agent involvement compared to static teams. The system supports both proprietary (e.g., GPT-4) and open-weight (e.g., LLaMA-3-70B) LLMs as agent backbones, enabling cost-performance trade-offs.
Empirical Evaluation
Benchmarks and Scenarios
Captain Agent is evaluated on six real-world scenarios: mathematics (MATH), programming (HumanEval), data analysis (DABench), scientific problem-solving (SciBench: chemistry and physics), and world information retrieval (GAIA). Each scenario is paired with a challenging open-source dataset.
Baselines
Comparisons include:
- Vanilla LLM (single prompt)
- AutoAgents (static multi-agent)
- Meta-prompting (meta-model task decomposition)
- AutoGen Assistant + Executor (two-agent system)
- Scenario-specific baselines for GAIA
All methods use the same task-specific prompts and backbone LLMs for fairness.
Results
- Captain Agent achieves a mean accuracy improvement of 21.94% over baselines across all scenarios.
- In world information retrieval (GAIA), Captain Agent outperforms all leaderboard baselines with minimal prompt engineering.
- Ablation studies show that adaptive team-building consistently outperforms static team-building, especially in scenarios requiring dynamic expertise composition.
- Both agent and tool libraries are critical for optimal performance; removing either significantly degrades results, particularly on complex, multi-step tasks.
- Open-weight models (e.g., LLaMA-3-70B) as agent backbones can approach or surpass the performance of some proprietary models at a fraction of the cost, though task preference and model selection remain important.
Analysis and Implications
Theoretical Implications
The adaptive team-building paradigm operationalizes principles from human organizational behavior—dynamic team assembly, role specialization, and iterative reflection—within LLM-based agent systems. This approach addresses the context length and specialization limitations of static teams, enabling more scalable and robust multi-agent collaboration.
Practical Implications
- Generalization: Captain Agent requires only basic task instructions, avoiding heavy prompt engineering and manual agent design.
- Scalability: Adaptive team-building reduces context bloat and irrelevant agent participation, improving efficiency and conversational quality.
- Cost-Effectiveness: The ability to leverage open-weight models and minimize unnecessary agent involvement enables practical deployment in resource-constrained settings.
- Robustness: The reflection mechanism and verification process mitigate hallucinations, factual errors, and stereotypical outputs.
Limitations and Future Directions
- Cost: Multi-agent conversations with large models remain expensive; further work on conversation pruning and context compression is warranted.
- Model Diversity: Task preference among LLMs affects nested chat quality; systematic evaluation and selection of agent backbones is needed.
- Evaluation: Data leakage and benchmark limitations complicate fair assessment of agent capabilities; more rigorous evaluation protocols are necessary.
Conclusion
The adaptive in-conversation team-building paradigm, instantiated by Captain Agent, demonstrates significant improvements in multi-agent LLM task-solving across diverse domains. By dynamically assembling specialized teams, integrating tool use, and employing iterative reflection, the approach overcomes key limitations of static team-building. The results highlight the importance of adaptability, modularity, and reflection in the design of LLM-based agent systems. Future research should address cost reduction, model diversity, and evaluation rigor to further advance the practical deployment of adaptive multi-agent frameworks.