LLM-to-SLM Agent Conversion Algorithm

Updated 30 June 2025

LLM-to-SLM agent conversion is a methodology that transforms LLM-dependent agent systems into efficient, low-cost SLM-based architectures.
It employs dynamic agent selection, hybrid inference, and modular designs to optimize performance with reduced computational overhead.
The approach leverages data-driven fine-tuning and self-improving mechanisms to achieve scalable, task-specific deployment in multi-agent systems.

The LLM-to-SLM Agent Conversion Algorithm refers to a suite of methodologies and architectural strategies for transforming agentic systems reliant on LLMs into highly efficient, cost-effective, and often modular systems leveraging small LLMs (SLMs). This conversion aims to maintain or enhance agentic capabilities—such as collaborative problem solving, strategic reasoning, and tool use—while reducing computational cost, latency, and infrastructure overhead. The topic spans foundational algorithms for agent selection, hybrid inference acceleration, modular system design, resource allocation, and adaptive specialization.

1. Agent Selection and Dynamic Team Optimization

A central challenge in LLM-to-SLM agent conversion is selecting an optimal subset of agents (LLMs, SLMs, or a mix) tailored to the current domain or task. The DyLAN framework introduces the Agent Importance Score (AIS), an unsupervised metric quantifying each agent's contribution to team performance through a multi-step, peer-evaluation process. The team optimization proceeds as follows:

Propagation: All candidate agents participate in solving sample tasks, with each agent rating the solutions of its predecessors at each time step.
Aggregation: Incoming peer scores are aggregated recursively, yielding a per-agent contribution per round.
Selection: The overall AIS for agent $i$ is $I_{i} = \sum_{t=1}^T I_{t,i}$ , where $I_{t,i}$ is the agent’s contribution at time $t$ . Top- $k$ agents are selected for deployment.
Dynamic Filtering & Early Stopping: Non-contributory agents are pruned during inference; consensus among active agents (Byzantine consensus) allows for early termination, further reducing computation.

This methodology is model-agnostic—SLMs and LLMs can be considered equally. Empirical results on code generation, reasoning, and subject-specific tasks demonstrate DyLAN's gains of 4–9.7% in accuracy and up to 65% reduction in cost per task, with SLMs preferentially selected wherever effective. The process enables mixed or pure SLM agent teams and is robust to data scarcity, reaching near-optimal selection with as little as 10% of available examples (Liu et al., 2023).

2. Hybrid Model Architectures for Fast Inference

LLM-to-SLM conversion is also realized through hybrid inference mechanisms, with SLMs conditioned on LLM-derived representations. The LLM-to-SLM architecture for fast autoregressive decoding, as proposed in “Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding,” comprises:

Prompt Encoding: A frozen LLM encodes the prompt via a parallelizable forward pass.
Representation Adaptation: A lightweight projector MLP adapts LLM features to the SLM input.
SLM Decoding: The SLM, guided by these representations, handles all token-wise autoregressive decoding.

The system supports two fusion strategies (embedding addition or replacement), with the SLM exclusively fine-tuned, while the LLM stays fixed. Empirical results show that this architecture achieves up to $4.2\times$ speedup (e.g., 14.8 ms/token for T5 Small vs. 61.5 ms/token for T5 Large) with less than 1–2% performance drop in translation and summarization tasks, mainly attributable to prompt adaptation overhead. The approach generalizes across model families and is compatible with further acceleration techniques, enabling near-LLM performance at SLM-level cost (Bergner et al., 26 Feb 2024).

3. Modular Architectures and Role Separation

LLM-Agent-UMF establishes a comprehensive, modular reference framework for agent conversion and system design. Agents are decomposed into:

LLMs: Responsible for complex reasoning and generation.
Tools: External function providers (databases, APIs).
Core-Agents: Newly defined central coordinators, sub-typed as:
- Active Core-Agents: Possessing planning, memory, profile, action, and security modules; responsible for orchestration.
- Passive Core-Agents: Equipped only with action and security modules; perform stateless, directive execution.

Multi-core agent architectures facilitated by UMF (e.g., “one-active-many-passive”) enable easy replacement of LLM-based modules with SLM-powered passive core-agents where tasks are well-bounded. The framework’s classification supports hybrid integration, modular upgrading, and systematic identification and mitigation of security and privacy gaps (Hassouna et al., 17 Sep 2024).

4. Resource Allocation and Coordination in Multi-Agent Systems

Optimizing task assignment and maximizing utilization in agent teams are core aspects of conversion. Two coordination paradigms are:

Orchestrator: A central agent plans and assigns all actions, analogous to a Hungarian algorithm solution for assignment problems; suitable for small systems but scales poorly.
Planner: A semi-centralized high-capacity agent generates high-level plans, distributed to SLM “worker agents” for decentralized, parallel execution. This planner-based approach maximizes efficiency ( $\#$ completed orders/$\$$cost), especially when agents have heterogeneous capabilities.

Empirical studies show that access to explicit worker capability increases efficiency, and decentralized execution harnesses team parallelism, supporting larger and more diverse SLM-based systems (Amayuelas et al., 2 Apr 2025).

5. Self-Improving and Specialized Agents

Self-improvement via agentic scaffolding provides a route for functional LLM-to-SLM conversion. The SICA agent demonstrates:

Iterative self-editing of its own codebase and tools, guided by a modular architecture and performance-driven utility function:

$U = w_{score}~p_{score} + w_{cost}\left[1-\min\left(1,\frac{P_{cost}{\$10}}\right)\right] + w_{time}\left[1-\min\left(1,\frac{P_{time}{300\,\text{s}}}\right)\right] $</p> <ul> <li>Use of sub-agents, tool abstractions, and <a href="https://www.emergentmind.com/topics/human-in-the-loop-hitl" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">human-in-the-loop</a> safety for robust, autonomous specialization.</li> <li>Performance gains up to 53% on SWE Bench Verified and improved cost/time per task as new sub-agents and tools emerge.</li> </ul> <p>This framework demonstrates conversion by continual, domain-driven specialization, as opposed to only training or pruning (<a href="/papers/2504.15228" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Robeyns et al., 21 Apr 2025</a>).</p> <h2 class='paper-heading' id='general-data-driven-llm-to-slm-conversion-algorithms'>6. General Data-Driven LLM-to-SLM Conversion Algorithms</h2> <p>The systematic process outlined in “Small LLMs are the Future of <a href="https://www.emergentmind.com/topics/agentic-ai" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Agentic AI</a>” comprises:</p> <ol> <li><strong>Secure Usage Data Collection</strong>: Logging all non-HCI prompts/responses.</li> <li><strong>Data Curation and Filtering</strong>: Removal of sensitive content and standardization.</li> <li><strong>Unsupervised Task Clustering</strong>: Identification of recurring tasks via embedding-based clustering.</li> <li><strong>SLM Selection</strong>: Choice of candidate SLMs based on capability, performance, and cost.</li> <li><strong>SLM <a href="https://www.emergentmind.com/topics/fine-tuning-sft" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Fine-Tuning</a> or Distillation</strong>: Task- or cluster-specific adaptation by LoRA, QLoRA, or supervised distillation, with loss:</li> </ol> <p>$ \mathcal{L}_{\text{distill}} = \sum_{(p_i, r_i)} \text{CE}\big(P(r_i|p_i; \theta_{\text{SLM}}), P(r_i|p_i; \theta_{\text{LLM}})\big) $</p> <ol> <li><strong>Iteration and Maintenance</strong>: Continuous refinement and replacement as the agent system evolves.</li> </ol> <p>This loop enables rapid, cost-efficient migration of non-conversational subtasks to SLMs, with case studies indicating that 40–70% of LLM calls in practical systems (e.g., MetaGPT, Cradle) can be supplanted by SLMs (<a href="/papers/2506.02153" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Belcak et al., 2 Jun 2025</a>).</p> <h2 class='paper-heading' id='specialization-and-adaptation-in-multi-agent-and-embodied-settings'>7. Specialization and Adaptation in Multi-Agent and Embodied Settings</h2> <p>The LIET framework introduces a paradigmatic approach for specialized SLM adaptation:</p> <ul> <li><strong>Individual Learning</strong>: Each agent learns a local utility function$ f(\ell_{o_i}, \ell_a) $, mapping observations and candidate actions to estimated cost (e.g., step count), trained via mean-squared error on interaction datasets:</li> </ul> <p>$ \mathcal{L}(\theta) = \mathbb{E}_{(\ell_{o_i},\ell_a,c^{\text{GT}})} \left[ (c^{\text{GT}} - f(\ell_{o_i}, \ell_a; \theta))^2 \right] $</p> <ul> <li><strong>Team Evolution</strong>: Agents iteratively update a shared communication knowledge list, guiding and refining inter-agent messaging through reflection and observed outcomes.</li> <li><strong>Centralized Training → Decentralized Execution</strong>: Supports both robust coordination and flexible, environment-specific adaptation across agent populations.</li> </ul> <p>Empirical validation shows strong generalization—utility functions and knowledge protocols trained in small-team or limited-domain settings successfully transfer to new agents and scenarios, confirming the viability of SLM conversion approaches for embodied, real-world collaborative systems (<a href="/papers/2506.07232" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Li et al., 8 Jun 2025</a>).</p> <hr> <h2 class='paper-heading' id='summary-table-representative-llm-to-slm-agent-conversion-techniques'>Summary Table: Representative LLM-to-SLM Agent Conversion Techniques</h2><div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Conversion Axis</th> <th>Key Mechanism</th> <th>Empirical/Practical Result</th> </tr> </thead><tbody><tr> <td>Agent Selection</td> <td>Unsupervised scoring, dynamic filtering</td> <td>+4–25% accuracy, −45–65% cost</td> </tr> <tr> <td>Hybrid Inference</td> <td>Prompt encoding via LLM, SLM autoregression</td> <td>$ 3\times $–$ 4\times$ speedup, ≤2% loss Modular Architectures Core-agent role separation (active/passive) Systematic, scalable hybrid pipelines Resource Allocation Planner-based, parallel execution Higher efficiency, supports heterogeneity Self-Improvement Automated self-editing, modular sub-agents 17–53% coding performance improvement Data-Driven Specialization Logging, clustering, SLM fine-tuning 40–70% tasks served by SLMs in practice Adaptive Specialization Utility learning and shared communication Transferable, cooperative SLM agents

LLM-to-SLM agent conversion is realized through a spectrum of algorithmic and architectural strategies, ranging from unsupervised agent selection and hybrid decoding architectures to modular software refactoring and planner-based task routing. Collective empirical evidence demonstrates that a principled move to SLM-powered agents achieves substantial resource savings and operational flexibility, with robust or superior performance on specialized and structured tasks. Methodologies such as DyLAN, LLM-to-SLM hybrid decoding, STRIDE, LLM-Agent-UMF, planner-based coordination, and LIET provide concrete mechanisms, validated on a diverse set of benchmarks and real-world scenarios. These developments position SLM-centric agentic AI as a scalable, cost-effective paradigm for the future of multi-agent and task-oriented intelligent systems.