Agent Distillation

Updated 20 November 2025

Agent distillation is a method that transfers complex, multi-step agentic behaviors and reasoning trajectories from powerful teacher agents to smaller, efficient student agents.
It employs trajectory-centric training with segmented loss functions and multi-agent strategies to retain decision quality while reducing inference costs.
Empirical results show enhanced performance in tasks like VideoQA, domain-specific LLM applications, and multi-agent reinforcement learning, boosting accuracy and convergence speed.

Agent distillation is a family of frameworks and methodologies that transfer structured, agentic behaviors, planning routines, or collaborative reasoning strategies from one or more strong “teacher” agents into a smaller, more efficient “student” agent. Unlike classic knowledge distillation, which typically focuses on prediction alignment at the token or action level, agent distillation operates at the level of multi-step trajectories—encompassing interleaved reasoning, tool calls, action choices, and intermediate observations. These frameworks aim to compress agentic capabilities, reduce inference costs, retain or enhance generalization, and internalize procedural or collaborative solution strategies.

1. Key Principles and Definitions

Agent distillation extends knowledge distillation paradigms to settings where agentic decision-making and structured tool use are central. At its core, it is characterized by:

Trajectory-centric distillation: Instead of one-step outputs, the student model is trained on entire sequences of actions and thoughts, often structured as $(\text{Thought}_t, \text{Action}_t, \text{Observation}_t)$ , “Reason–Act” spans, or interaction graphs (2505.13820, Kang et al., 23 May 2025, Chen et al., 2024).
Segmented or structure-aware objectives: Distillation losses may be differentiated across reasoning, action, and observation tokens, e.g., Structured Agent Distillation applies separate KL-divergence penalties over [REASON] and [ACT] segments (2505.13820).
Multi-agent to single-agent transfer: Some frameworks distill knowledge not just from a single agent, but from multi-agent systems (e.g., Chain-of-Agents, MAGDi), joint reasoning graphs, or collaborative sessions, into a compact, single-agent policy (Li et al., 6 Aug 2025, Chen et al., 2024).
Tool-awareness and compositionality: Effective agent distillation often includes supervised imitation of tool invocation, code execution, retrieval, or explicit sub-task decompositions (Kang et al., 23 May 2025, Shi et al., 2024).
Verification and filtering: High-quality distillation pipelines frequently filter or verify generated trajectories using correctness and coherence checks, as in AoTD (Shi et al., 2024) or KG-MASD (Pan et al., 3 Oct 2025).

2. Distillation Methodologies

Agent distillation exhibits diverse methodological incarnations depending on the agent paradigm and the domain:

Chain-of-Thought (CoT) and Action Supervision
- In agentic LLMs, the student is trained to reproduce teacher-generated, multi-step reasoning traces—often in the Thought–Action–Observation format (Kang et al., 23 May 2025).
- Example: The objective
$\mathcal{L}_{\rm AD} = -\mathbb{E}_{x, \tau}\, \left[ \sum_{t=1}^T \log p_S(t_t | x, \tau_{<t}) + \log p_S(a_t | x, \tau_{<t}, t_t) \right]$

where $\tau$ is the agent trajectory (teacher), and $p_S$ is the student (Kang et al., 23 May 2025).
Segmented Distillation Losses (Structure-Aware)
- Reasoning and action spans are explicitly segmented, and separate losses (typically KL-divergence) are applied:
$\mathcal{L}_{\mathrm{CoT}} = \sum_{t=1}^T m_r(t)\, \mathrm{KL}(p_T(x_t)\;\|\;p_S(x_t)),\quad \mathcal{L}_{\mathrm{Act}} = \sum_{t=1}^T m_a(t)\,\mathrm{KL}(p_T(x_t)\;\|\;p_S(x_t))$

with masks $m_r, m_a$ indicating reasoning and action phases (2505.13820).
Graph-Structured Multi-Agent Distillation
- Multi-agent reasoning is captured as a reasoning graph (MAG); objectives include next-token prediction, contrastive losses, and graph-based structural classification (Chen et al., 2024).
Trajectory Generation and Verification
- Automated agents decompose problems into sub-tasks, invoke specialized models, and produce textual chains-of-thought that are then filtered for correctness and coherence before being distilled (Shi et al., 2024).
Protocol and Tool Distillation
- AgentDistill distills not by parameter fine-tuning, but by extracting, clustering, and reusing structured Model–Context–Protocols (MCPs), which are code-level procedural modules, directly re-used by frozen student agents (Qiu et al., 17 Jun 2025).
Dual-Strategy and Multi-Strategy Distillation
- Methods such as DualDistill combine trajectories from multiple agents—e.g., text-based CoT and tool-using agents—by composing mixed demonstrations and letting the student learn to switch strategies as needed (Du et al., 8 Jul 2025).

3. Applications and Domains

Agent distillation is applied across a wide range of domains, each requiring different technical adaptations:

Video-Language Reasoning: AoTD constructs automated, multi-tool chains-of-thought for VideoQA, integrates objective and coherence-based verification, and distills into Video-LLMs such as LLaVA-NeXT-Video, improving both accuracy and spatial-temporal groundability (Shi et al., 2024).
Domain-Specific LLMs: In microdomains (e.g., IT operations, Hitachi’s JP1), agent distillation internalizes lengthy ReAct/CoT trajectories, yielding significant accuracy and efficiency gains relative to few-shot or standard SFT (Xue et al., 1 Oct 2025).
Multi-Agent Reinforcement Learning (MARL):
- KnowRU and PTDE integrate teacher-to-student knowledge transfer at the actor level or by distilling global personalized embeddings into local policies, significantly accelerating convergence and improving asymptotic performance (Gao et al., 2021, Chen et al., 2022).
- Double Distillation Network uses both external and internal distillation modules—distilling from centralized teacher features and rewards—improving robustness under partial observability (Zhou et al., 5 Feb 2025).
Tool-augmented Language Agents: Small LLMs distilled on full agentic trajectories with tool invocations achieve performance competitive with much larger CoT-distilled models, especially with methods such as first-thought prefix prompting and self-consistent action selection (Kang et al., 23 May 2025).
Multi-Agent Reasoning Graphs: MAGDi distills the interaction structure of multi-agent debates into a student LLM using combined next-token, contrastive, and graph-based objectives, yielding strong generalization and sample efficiency (Chen et al., 2024).
Multi-Agent System Distillation with Knowledge Graphs: KG-MASD formulates the distillation problem as a Markov Decision Process enriched with a knowledge graph prior, enforcing collaborative reasoning verification to ensure reliability in industrial QA (Pan et al., 3 Oct 2025).
Tool Library and Protocol Distillation: AgentDistill enables training-free transfer by extracting reusable code modules, which student agents can dynamically assemble for complex reasoning or symbolic math (Qiu et al., 17 Jun 2025).

4. Empirical Effects and Quantitative Results

Across settings, agent distillation yields validated performance improvements:

Domain/Task	Method / Benchmark	Baseline	Agent Distillation	∆ Performance
VideoQA (MC-VQA)	MVBench (LLaVA-NeXT-Video)	53.4%	55.6% (AoTD)	+2.2
Domain QA (JP1)	Engineer Level	68% (base SFT)	84% (agent FT)	+16
MARL (Simple Spread)	MAAC (time-to-thresh)	37,000 episodes	29,000 (KnowRU)	–21.6%
MARL (SMAC 3s_vs_5z)	QMIX_GIU	86.8%	99.2% (PTDE GIP)	+12.4
Reasoning LM (MAGDi)	Mistral-7B (no teacher)	46.17%	56.88%	+10.71
Biomedical VQA (PathVQA)	LLaMA3.1-8B (before MCP)	46.7%	50.0% (AgentDistill)	+3.3
Computation Math (DeepMath-L)	Deepseek-R1-Distill-7B (CoT)	56.3%	65.3% (Agentic-R1, SD)	+9.0

Key trends include increased convergence speed (MARL), higher final task accuracy (VideoQA, domain QA, QA reasoning), more robust policy transfer (MARL, agentic RL), and improved efficiency (smaller models matching larger teacher performance).

5. Theoretical, Structural, and Implementation Considerations

Loss Function Design: Segment-aware, trajectory-level, or joint (reasoning/action/graph) loss decomposition is essential for maintaining both reasoning fidelity and decision/action consistency (2505.13820, Chen et al., 2024).
Verification and Quality Control: Agent of Thoughts Distillation and KG-MASD both rely on multi-stage verification—final answer correctness and chain-of-thought coherence (AoTD), or triple validation and prior-quality index (KG-MASD)—to filter trajectories for distillation, supporting both accuracy and reliability (Shi et al., 2024, Pan et al., 3 Oct 2025).
Curriculum and Data Ordering: Curriculum tuning (sequence length, reasoning complexity, shuffled vs. CoT-last) affects learning efficiency and generalization, as shown in domain adaptation (Xue et al., 1 Oct 2025).
Architecture and Tool Encapsulation: Unified LLM architectures reserve special tokens for agent or tool-triggering; agent selection and response generation are jointly learned (Li et al., 6 Aug 2025).
Training-free Distillation: AgentDistill departs from standard parameter tuning; instead, it relies on structured extraction, abstraction, and clustering of protocol modules, reused as plug-and-play tools in the student agent (Qiu et al., 17 Jun 2025).

6. Limitations, Open Questions, and Future Directions

Current limitations and research opportunities in agent distillation include:

Data Coverage & Generalization: If the set of distilled modules or trajectories does not adequately cover novel test distributions, student agents may revert to baseline or fail to generalize (Qiu et al., 17 Jun 2025).
Computational Cost: While the distilled student is efficient at inference, generation and filtering of high-quality agentic data is computationally and resource intensive (e.g., reliance on GPT-4 teachers, as in (Xue et al., 1 Oct 2025)).
Verification Reliability: Automated LLM-based filters have failure modes; robustness to noisy or hallucinated intermediate outputs is an open research question (Shi et al., 2024, Pan et al., 3 Oct 2025).
Strategy Selection & Composition: Current frameworks for multi-strategy distillation (e.g., DualDistill) require hand-designed transition cues; automated learning and explicit strategy-switching modules are unaddressed (Du et al., 8 Jul 2025).
Training-free and Meta approaches: The potential for fully online or meta-distillation, continual expansion of protocol libraries, and dynamic MCP retrieval remains open (Qiu et al., 17 Jun 2025).
Applicability to Multimodal/Multi-agent Teams: Scaling agent distillation to cover vision, language, action, and symbolic reasoning in tightly-coupled agent teams is an emerging challenge (Li et al., 6 Aug 2025, Chen et al., 2024).
Security, privacy, and alignment: The distillation of rich, agentic behaviors or tool-use into small models brings further challenges in controllability and adherence to safety constraints.

7. Summary Table: Agent Distillation Approaches

Framework	Distillation Granularity	Structural Supervision	Tool/Action Supervision	Verification	Notable Benchmark Domains	Reference
AoTD	CoT + API calls	Chains-of-Thought	Yes (vision tools)	Correctness+Coherence	VideoQA, Multi-choice, OE-QA	(Shi et al., 2024)
StructuredAD	Reason/Act spans	KL-div at segment	Yes	–	WebShop, ALFWorld, QA	(2505.13820)
MAGDi	Multi-agent interaction	Graph/contrastive	–	Node label	Commonsense, Math Reasoning	(Chen et al., 2024)
CoA	Agent+Tool/switch	Agent token selection	Yes	Reflection, RL check	Web QA, CodeGen, Math Benchmarks	(Li et al., 6 Aug 2025)
AgentDistill	Protocol modules (no FT)	Protocol clusters	Direct reuse	Syntactic, exec	Biomedical QA, Game-of-24, SLAKE	(Qiu et al., 17 Jun 2025)
DualDistill	Strategy composition	Trajectory mask	Text/tool mix	Grader	DeepMath-L, AMC, AIME	(Du et al., 8 Jul 2025)
KG-MASD	KG-triple state MDP	Knowledge graph prior	Triple extraction	Triple validation	Industrial QA, expert domains	(Pan et al., 3 Oct 2025)
Agent fine-tuning	ReAct/CoT trajectories	Autoregressive token	Yes	Cross-entropy only	IT/JP1 microdomain	(Xue et al., 1 Oct 2025)
KnowRU, DDN, PTDE	Action/policy in MARL	Actor/critic, embedding	No, or env action	–	MARL (SMAC, MPE, Football, LTR)	(Gao et al., 2021, Zhou et al., 5 Feb 2025, Chen et al., 2022)

Agent distillation provides a suite of theoretically grounded and empirically validated techniques for internalizing complex agent behaviors into efficient, generalizable, and modular student agents, suitable for web-scale problem-solving, multimodal reasoning, tool-augmented language processing, and multi-agent reinforcement learning. Its ongoing evolution reflects the convergence of multi-agent systems, tool-use LLMs, and efficient model compression paradigms.