Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distilling LLM Agent into Small Models with Retrieval and Code Tools (2505.17612v1)

Published 23 May 2025 in cs.CL and cs.AI

Abstract: LLMs excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller LLMs (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.

Summary

  • The paper presents Agent Distillation, a framework that transfers interactive LLM agent behaviors into small models using retrieval and code tools.
  • The paper employs two novel techniques—first-thought prefix and self-consistent action generation—to improve planning and reduce errors, achieving performance comparable to larger models.
  • The paper demonstrates enhanced generalization on factual and mathematical tasks, underscoring the value of tool-augmented reasoning for efficient on-device AI.

LLMs have demonstrated impressive capabilities in complex reasoning tasks, but their significant computational cost hinders practical deployment. Smaller LLMs (sLMs) are more efficient but struggle to replicate these complex reasoning abilities. While Chain-of-Thought (CoT) distillation, which trains sLMs to mimic LLM reasoning traces, has shown promise, distilled sLMs often fail on tasks requiring precise factual knowledge or calculations not seen during training, leading to hallucinations.

This paper introduces Agent Distillation, a framework designed to transfer not just static reasoning, but the full task-solving behavior of LLM agents into sLMs. This is achieved by distilling interactive reason-act-observe trajectories generated by a teacher LLM agent that uses tools like retrieval and code execution. The goal is to train sLMs to reason through problems and then take actions using these tools, observe the outcomes, and adapt their approach, effectively cloning the teacher's agentic behavior. This approach aims to improve generalization by teaching the sLM how to use tools to find information or perform calculations, rather than requiring it to memorize specific facts or computational steps.

The authors propose two key methods to enhance agent distillation, particularly for small models:

  1. First-thought prefix: This method addresses the observation that LLM agents, when directly prompted, may exhibit different initial reasoning patterns compared to CoT prompting, sometimes leading to degraded performance on tasks they could solve with CoT. By prepending the first reasoning step generated by the teacher LLM using a standard CoT prompt as a prefix to the agent's initial thought, the method encourages the teacher agent to generate trajectories that start with more structured, reflective planning, aligning better with the instruction-tuned behavior and providing better supervision for the student.
  2. Self-consistent action generation: Small distilled agents may struggle to produce valid actions, especially functional code. At inference time, this method samples multiple thought-action sequences for each step using diverse decoding. It then filters out sequences leading to parsing or execution errors using a lightweight interpreter. Among the valid sequences, it performs majority voting over the resulting observations to select the action that produces the most consistent outcome, improving robustness.

The proposed Agent Distillation framework was evaluated using Qwen2.5-Instruct models, with the 32B model as the teacher and 0.5B, 1.5B, 3B, and 7B models as students. Evaluation was conducted on eight reasoning tasks, four factual (HotpotQA, Bamboogle, MuSiQue, 2WikiMultiHopQA) and four mathematical (MATH, GSM-Hard, AIME, OlymMATH), covering both in-domain and out-of-domain generalization. The performance was compared against CoT distillation and CoT distillation augmented with RAG.

The results show that Agent Distillation, especially when combined with the first-thought prefix and self-consistent action generation, consistently improves the performance of small models across tasks compared to standard CoT distillation. Critically, the distilled small agents (0.5B, 1.5B, 3B) achieved performance comparable to or better than larger models (1.5B, 3B, 7B respectively) fine-tuned using CoT distillation. The agent approach is particularly effective on out-of-domain tasks, highlighting its improved generalization. For factual tasks, the agent, with its ability to adaptively retrieve information, outperformed even RAG-enhanced CoT models. For mathematical tasks, tool use for calculations boosted performance on complex problems like AIME and OlymMATH and improved robustness on GSM-Hard.

Analysis revealed that the first-thought prefix helps small agents tackle more complex math problems, steering them towards structured planning, although it can sometimes lead to reliance on internal knowledge rather than retrieval in factual tasks. Self-consistent action generation significantly reduces invalid code generation errors, particularly for smaller models. While larger agents naturally make more retrieval calls, the first-thought prefix was observed to reduce retrieval calls, suggesting a trade-off between structured internal thought and tool utilization. Comparing self-consistent action generation with CoT self-consistency showed that the agent approach with SAG remained superior on challenging math problems like AIME under similar computational budgets. Token count analysis indicated that agents do not necessarily generate more tokens than CoT models; they tend to use more for factual tasks requiring multiple retrievals and fewer for math tasks by offloading computation to code.

The authors acknowledge limitations, including testing only on the Qwen2.5 model family and a single teacher model, not investigating the impact of trajectory count, and focusing solely on retrieval and code tools. The work has positive broader impacts by enabling more accessible and privacy-preserving AI on local devices but also raises concerns about potential misuse for malicious activities, necessitating robust safeguards. The authors suggest future work could explore improved trajectory generation tailored for sLMs and incorporating reinforcement learning for further refinement of tool use post-distillation.

Overall, Agent Distillation presents a practical framework for building capable, tool-using small LLMs, overcoming limitations of traditional reasoning distillation and paving the way for efficient on-device AI agents. (2505.17612)

Github Logo Streamline Icon: https://streamlinehq.com