Agent-FLAN: Enhancing LLM Agent Performance

Updated 22 February 2026

Agent-FLAN is a comprehensive framework that boosts LLM agent performance through data realignment and capability-driven fine-tuning.
It addresses corpus distribution shifts and differential learning speeds by applying targeted sampling ratios to distinct agent capacities such as reasoning and understanding.
The framework incorporates negative sampling to reduce hallucinations and has demonstrated improved accuracy in both individual and federated learning benchmarks.

Agent-FLAN is a comprehensive framework for systematically enhancing the agentic capabilities of LLMs. It targets the persistent performance gap between open-source LLMs and proprietary API-based models in language-agent settings and operationalizes a rigorous approach to fine-tuning LLMs for robust agent use. Developed by Chen et al., Agent-FLAN is grounded in extensive empirical analysis of training corpora, capability learning dynamics, and hallucination phenomena, and delivers a principled methodology for corpus realignment, capability-driven data partitioning, negative-sample incorporation, and integrated objective design. Additionally, a variant of Agent-FLAN has been adopted in federated learning (FL) orchestration frameworks under the FedAgentBench suite, demonstrating its versatility in multi-agent, real-world distributed environments (Chen et al., 2024, Saha et al., 28 Sep 2025).

1. Foundational Observations: Challenges in Agent Tuning

Agent-FLAN is motivated by three fundamental observations about the limitations of current agent-tuning pipelines for LLMs (Chen et al., 2024):

1. Corpus Distribution Shift: Agent training datasets—such as those based on the ReAct template (“Thought: … Action: …”) or JSON tool-invocation scripts—impose structured formats that are distant from the dialogic distributions typical in LLM pretraining. Fine-tuning on such data results in rapid overfitting to template tokens while leaving content understanding and reasoning underdeveloped.

2. Differential Learning Speeds: Detailed token-level annotation by “capability” (Instruction Following, Reasoning, Retrieval, Understanding) reveals that LLMs assimilate formatting instructions nearly instantaneously, while reasoning requires significantly more gradient steps. Uniform data mixing is thus suboptimal; capabilities require individual sampling and progression schedules.

3. Hallucination Effects: Standard fine-tuning amplifies two forms of hallucination: “format hallucinations” (inadvertent template emission such as outputting “Action:” unprompted) and “action hallucinations” (spurious tool invocation or misuse). Conventional methods, including AgentTuning, may elevate success metrics but leave hallucination rates largely unchecked.

2. Agent-FLAN Corpus Realignment and Capability Decomposition

To address these issues, Agent-FLAN systematically decomposes and reconstitutes the agent training corpus, guided by empirical findings from error analysis and loss trajectories (Chen et al., 2024).

Chat-Aligned Reformulation: All agent interaction data are re-expressed as natural multi-turn dialogues, closely matching the LLM’s pre-training domain. For example, a ReAct-format triple is converted to user-assistant exchanges, reducing format overfitting and improving generalization. To preserve backward compatibility, approximately 10% of examples retain explicit template instructions.

Capability Partitioning and Sampling: Each training instance is labeled according to four core agent capacities:

Instruction Following (Inst)
Reasoning (Reas)
Retrieval (Ret)
Understanding (Und)

Let $D = D_{\mathrm{Inst}} \cup D_{\mathrm{Reas}} \cup D_{\mathrm{Ret}} \cup D_{\mathrm{Und}}$ denote these capability partitions. Empirical ablations show that reducing Reasoning data has a disproportionately negative impact on agent performance (e.g., a 1.1-point T-Eval score drop when halved). Sampling rates are tuned according to learning speed: $\lambda_\mathrm{Reas} : \lambda_\mathrm{Und} : \lambda_\mathrm{Ret} : \lambda_\mathrm{Inst} = 1 : 0.75 : 0.25 : 0.1$ , prominently favoring Reasoning and Understanding.

Capability	Description	Relative Sampling Ratio
Reasoning	Justifies actions/answers	1
Understanding	Constructs valid arguments	0.75
Retrieval	Selects correct tools	0.25
Inst. Follow	Enforces format compliance	0.1

3. Negative Sample Synthesis for Hallucination Suppression

Agent-FLAN introduces negative samples to explicitly train the model not to hallucinate agent actions. Negative cases are constructed across four categories: situations with/without tool availability and with/without user requests for actions. Standard agent corpora only cover extremes [(a) and (d)], but Agent-FLAN synthesizes (b) and (c) by:

Omitting tool specs when the user requests tools (the model must resist spurious action calls).
Providing irrelevant tool specs for generic queries (the model must ignore unnecessary API invocation).

All samples are incorporated into the cross-entropy loss without explicit margin-based or contrastive objectives. The total fine-tuning objective combines positive ( $D^+$ ) and negative ( $D^-$ ) pools:

$\mathcal{L}_{\mathrm{Agent\text{-}FLAN}} = \sum_{(x,y)\in D^{+}} -\log p_\theta(y|x) +\sum_{(x,y)\in D^{-}} -\log p_\theta(y|x)$

This implicitly penalizes improper tool-use by lowering the probability assigned to hallucinated actions (Chen et al., 2024).

4. Fine-Tuning Protocol, Model Scaling, and Generalization

All LLaMA-2 variants (7B, 13B, 70B) are fine-tuned with a unified recipe: single epoch over the union of agent data and ShareGPT-style chat data (1:1 ratio), cosine learning rate decay (from $2\times10^{-5}$ with 10% linear warmup), and batch sizes of 32 or 128. Only capability partition weights are explicitly rebalanced to reflect empirical learning rates.

Scaling Laws:

Data Scaling: Subsampling the Agent-FLAN corpus (to 25%, 50%, 75%) reveals that most agentic gains accrue early, with diminishing returns for greater data quantities.
Model Scaling: Performance increases consistently from 7B to 13B to 70B parameter models on held-out agent tasks, with no observable saturation.

Impact on General Capabilities: Fine-tuning with Agent-FLAN slightly improves or at least does not degrade standard benchmarks (MMLU, GSM8K, HumanEval), indicating the chat-aligned, capability-balanced objective strengthens core reasoning and instruction-following skills (Chen et al., 2024).

5. Empirical Outcomes: Benchmarking and Hallucination Metrics

Agent-FLAN is evaluated on a comprehensive suite encompassing both held-in domains (AgentInstruct, ToolBench) and held-out datasets (HotpotQA, SciWorld, WebArena, T-Eval, Agent-H). Key quantitative results include:

Agent Performance: Llama2-7B+Agent-FLAN achieves 41.7% overall held-out accuracy versus 38.2% for AgentTuning—a 3.5-point improvement. On T-Eval, Agent-FLAN yields 66.0% versus 61.8%.
Hallucination Reduction: The Agent-H benchmark computes

$H_{\mathrm{Score}} = \frac{1}{2} \left[(1 - H_{\mathrm{ReAct}}) + (1 - H_{\mathrm{Gen}})\right]$

where $H_{\mathrm{ReAct}}$ and $H_{\mathrm{Gen}}$ measure the proportion of ReAct- and general-format hallucinations, respectively. Agent-FLAN achieves $H_{\mathrm{Score}} = 89.1$ (vs AgentTuning’s 83.9), effectively cutting hallucination rates by approximately 50% (Chen et al., 2024).

6. Real-World Agentic Orchestration: The FedAgentBench Framework

FedAgentBench demonstrates the applicability of Agent-FLAN in a large-scale, real-world multi-agent setting for healthcare federated learning (Saha et al., 28 Sep 2025).

Architecture: Seven LLM agents are partitioned into server ( $S_1$ – $S_4$ ) and client ( $C_1$ – $C_3$ ) roles, conducting phases including client selection, data preprocessing, label harmonization, FL algorithm selection (among 40 methods), and orchestrated federated training over 201 datasets spanning six medical imaging modalities. Coordination occurs exclusively through code/config objects; raw private data remain siloed.

Performance: Top proprietary LLMs (GPT-4.1, DeepSeek-V3) successfully complete most steps (up to 100% precision/recall in client selection), but label harmonization and multi-step chaining remain challenging. Open-source variants (LLaMA-4 Scout/Maverick, Qwen QwQ) perform comparably on simpler stages but falter in complex, interdependent tasks.

Common Failure Modes: Persistent pitfalls include erroneous domain grounding, neglected cleaning steps, overconfident label mappings, hallucinated commands, modality mismatches, and excessive deliberation without execution (Saha et al., 28 Sep 2025).

7. Limitations and Future Directions

Agent-FLAN’s current corpus is limited to seven agent domains and a reduced ToolBench subset (~20K examples), constraining coverage of real-world use cases such as customer support or collaborative work. The negative-sample strategy, while effective, has not been generalized to multi-agent or continual learning settings. Extensions could include leveraging larger and more diverse corpora, explicit margin-based hallucination controls, and co-optimization of instruction and tool-usage abilities. Integration with real-world orchestration frameworks (e.g., FedAgentBench) highlights system-level coordination as an ongoing research frontier (Chen et al., 2024, Saha et al., 28 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models (2024)

FedAgentBench: Towards Automating Real-world Federated Medical Image Analysis with Server-Client LLM Agents (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Agent-FLAN.