Papers
Topics
Authors
Recent
2000 character limit reached

ToolOrchestra: Modular RL Framework

Updated 27 November 2025
  • ToolOrchestra is a reinforcement learning framework that employs a specialized 8B-parameter orchestrator to coordinate heterogeneous tools for complex, multi-step reasoning tasks.
  • It improves accuracy and reduces computational cost and latency by selectively delegating sub-tasks to expert models and modules based on user preferences.
  • The framework uses Group Relative PPO for policy optimization, ensuring robust tool selection and efficient integration of diverse capabilities.

ToolOrchestra is a reinforcement learning framework and agent architecture that leverages a small, specialized orchestrator model to coordinate the use of heterogeneous models and tools for solving complex, multi-step reasoning tasks. The approach is motivated by the observation that monolithic LLMs, despite their capabilities, face both conceptual and computational limitations in tasks requiring deep agentic behavior, such as the “Humanity’s Last Exam” (HLE)—a benchmark of PhD-level, multidisciplinary queries. ToolOrchestra’s key contribution is the use of an 8B-parameter orchestrator model, trained to select and compose diverse tools (including various LLMs and symbolic modules) in a manner that jointly optimizes for correctness, computational efficiency, and user preferences (Su et al., 26 Nov 2025).

1. Motivation and Design Rationale

The ToolOrchestra paradigm addresses the two-pronged challenge of conceptual brittleness and excessive computation in large, monolithic LLMs. Agentic tasks such as those in HLE involve extended chains of search, retrieval, symbolic reasoning, and code execution, leading to high risk of hallucinations and accumulated errors in single-model architectures. Additionally, inference via large models incurs significant API cost and latency.

ToolOrchestra adopts a modular approach, whereby a small orchestrator (“brain”) routes sub-tasks to a set of specialized tools or models, each optimized for different functionalities (e.g., web search, program synthesis, factual retrieval, domain-specific reasoning). This strategy yields improvements in end-to-end accuracy by enabling delegation at points of uncertainty, reduces mean computational cost and wall-clock latency by selective invocation of only expensive tools when necessary, and offers controllability by conditioning the orchestration policy on explicit user preferences (privacy, tool choice, or cost constraints) (Su et al., 26 Nov 2025).

2. Agent Architecture and Orchestration Scheme

The orchestrator is instantiated as an 8B-parameter model based on Qwen3-8B and operates in turn-based fashion. At each dialogue turn kk, it observes the full interaction history hkh_k and may emit either a chain-of-thought reasoning step or a JSON-formatted tool call, specifying the tool name and structured parameters.

The interface abstracts all tools (including search engines, code interpreters, and generalist/specialist LLMs) by a unique identifier, description, and schema-constrained parameters. The action space A\mathcal{A} at each turn is the union of all possible tool invocations and a lexical “final answer” option. The orchestrator only observes the cumulative textual history, relying on the environment to return post-tool observations.

Component Description Example Roles
Orchestrator Qwen3-8B (8B-parameter LLM) Policy selection, reasoning
Tool abstraction Name, NL description, JSON parameter schema “web-search”, “run-python-code”
Action space Tool invocation or final answer utterance Selects tool, passes arguments

State is modeled as skSs_k \in \mathcal{S} (unobserved by the orchestrator), observations as okOo_k \in \mathcal{O}, and the history hk=(u,o0,a0,o1,,ak1,ok)h_k = (u, o_0, a_0, o_1, \ldots, a_{k-1}, o_k) acts as the input for policy decisions (Su et al., 26 Nov 2025).

3. Reinforcement Learning Formulation

Multi-turn tool use is formalized as an episodic Markov decision process (MDP) M=(U,S,A,O,T,Z,r,ρ,γ)M = (\mathcal{U}, \mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{T}, \mathcal{Z}, r, \rho, \gamma), where the orchestrator’s policy πθ(akhk)\pi_{\theta}(a_k|h_k) generates sequences of tool calls/actions conditioned on the history.

Reward for a completed trajectory τ\tau is computed as a composition of:

  1. Outcome-aware reward: routcome(τ)=1r_\text{outcome}(\tau) = 1 if the final answer solves the task (measured against a GPT-5 “judge”); 0 otherwise.
  2. Efficiency-aware rewards: $r_\text{compute}(\tau) = -\$(\tau)(monetarycostofmodelcalls),(monetary cost of model calls),r_\text{latency}(\tau) = -\text{Clock}(\tau)(wallclocktime).</li><li><strong>Userpreferenceawarereward:</strong>Allowsuserstospecifyapreferencevector (wall-clock time).</li> <li><strong>User-preference-aware reward:</strong> Allows users to specify a preference vector \mathbf{P}overtoolsandobjectives.Therewardaggregatesnormalizedcountsandcostsoverthebatch:</li></ol><p> over tools and objectives. The reward aggregates normalized counts and costs over the batch:</li> </ol> <p>R(\tau) = \begin{cases} \mathbf{P} \cdot \mathbf{M}_\text{norm}^\tau, & \text{if}\ r_{\text{outcome}}(\tau) = 1 \ 0, & \text{otherwise} \end{cases}</p><p>where</p> <p>where \mathbf{M}_\text{norm}^{\tau}isthebatchnormalizedfeaturevector(toolcallcounts,outcome,compute,latency)(<ahref="/papers/2511.21689"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Suetal.,26Nov2025</a>).</p><p>PolicyoptimizationemploysGroupRelative<ahref="https://www.emergentmind.com/topics/proximalpolicyoptimisationppo"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">PPO</a>(GRPO),whichcomputestrajectoryrelativeadvantagesandappliesa<ahref="https://www.emergentmind.com/topics/clippedsurrogateobjective"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">clippedsurrogateobjective</a>tobalanceexplorationandexploitation:</p><p> is the batch-normalized feature vector (tool call counts, outcome, compute, latency) (<a href="/papers/2511.21689" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Su et al., 26 Nov 2025</a>).</p> <p>Policy optimization employs Group Relative <a href="https://www.emergentmind.com/topics/proximal-policy-optimisation-ppo" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">PPO</a> (GRPO), which computes trajectory-relative advantages and applies a <a href="https://www.emergentmind.com/topics/clipped-surrogate-objective" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">clipped surrogate objective</a> to balance exploration and exploitation:</p> <p>A(\tau) = \frac{R(\tau) - \mathrm{mean}_{\tau' \in T}R(\tau')}{\mathrm{std}_{\tau' \in T}R(\tau')}</p><p></p> <p>\mathcal{L}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\bigg[\min\big(\rho_{\theta}(\tau)\,A(\tau), \mathrm{clip}(\rho_\theta(\tau), 1-\epsilon, 1+\epsilon)\,A(\tau)\big)\bigg]</p><p>where</p> <p>where \rho_\theta(\tau) = \frac{\pi_\theta(\tau)}{\pi_{\theta_\mathrm{old}}(\tau)}(<ahref="/papers/2511.21689"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Suetal.,26Nov2025</a>).</p><h2class=paperheadingid=trainingregimenanddatageneration>4.TrainingRegimenandDataGeneration</h2><p>ToolOrchestratrainingleveragesToolScale,generatingover430Ksyntheticexamplesacross10domains(finance,medicine,travel,etc.).Theprocedureinvolves:</p><ul><li>LLMpromptedgenerationofdomainspecificdatabaseschemas,entries,andtoolAPIcatalogs.</li><li>Generationofuserintentsmappedtoconcretetasks,withgoldentoolcallsequences,optionallycomplexifiedforhigherdifficulty.</li><li>Qualitycontrolrequireserrorfreeexecution,solutioninpass@8byareferenceagent,andnontrivial(nonzerotoolcall)requirements.</li></ul><p><ahref="https://www.emergentmind.com/topics/finetuningsft"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Finetuning</a>isconductedwithrolloutbatchesof8trajectoriespertask,witheachepisodecappedat50turns.Thelearningrateissetto1e6,supportinginputlengthsupto24Ktokensandoutputgenerationupto8Ktokens.Randomizationoftoolsubsetsandpricingschedulesensurethatnostagedcurriculumisnecessary.</p><p>Stabilityisenforcedbyfilteringoutbatcheswithnearzerorewardvariance( (<a href="/papers/2511.21689" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Su et al., 26 Nov 2025</a>).</p> <h2 class='paper-heading' id='training-regimen-and-data-generation'>4. Training Regimen and Data Generation</h2> <p>ToolOrchestra training leverages ToolScale, generating over 430K synthetic examples across 10 domains (finance, medicine, travel, etc.). The procedure involves:</p> <ul> <li>LLM-prompted generation of domain-specific database schemas, entries, and tool-API catalogs.</li> <li>Generation of user intents mapped to concrete tasks, with “golden” tool-call sequences, optionally complexified for higher difficulty.</li> <li>Quality control requires error-free execution, solution in pass@8 by a reference agent, and nontrivial (non-zero tool call) requirements.</li> </ul> <p><a href="https://www.emergentmind.com/topics/fine-tuning-sft" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Fine-tuning</a> is conducted with rollout batches of 8 trajectories per task, with each episode capped at 50 turns. The learning rate is set to 1e-6, supporting input lengths up to 24K tokens and output generation up to 8K tokens. Randomization of tool subsets and pricing schedules ensure that no staged curriculum is necessary.</p> <p>Stability is enforced by filtering out batches with near-zero reward variance (\sigma<0.1),malformedtoolcalls,orinvalidoutputs(<ahref="/papers/2511.21689"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Suetal.,26Nov2025</a>).</p><h2class=paperheadingid=evaluationbenchmarkingandgeneralization>5.Evaluation,Benchmarking,andGeneralization</h2><p>Evaluationisperformedonthefollowingbenchmarks:</p><ul><li><strong>HLE(HumanitysLastExam,textonly):</strong>PhDlevelmultidisciplinaryreasoningtasks.</li><li><strong>FRAMES:</strong>Over800multihopfactualretrieval/complexreasoningquestionsfromWikipedia.</li><li><strong>), malformed tool calls, or invalid outputs (<a href="/papers/2511.21689" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Su et al., 26 Nov 2025</a>).</p> <h2 class='paper-heading' id='evaluation-benchmarking-and-generalization'>5. Evaluation, Benchmarking, and Generalization</h2> <p>Evaluation is performed on the following benchmarks:</p> <ul> <li><strong>HLE (Humanity’s Last Exam, text-only):</strong> PhD-level multidisciplinary reasoning tasks.</li> <li><strong>FRAMES:</strong> Over 800 multi-hop factual-retrieval/complex reasoning questions from Wikipedia.</li> <li><strong>\tau^2$-Bench:</strong> Domain-specialized function calling (telecom, retail, airline dialogue).</li> </ul> <div class='overflow-x-auto max-w-full my-4'><table class='table border-collapse w-full' style='table-layout: fixed'><thead><tr> <th>Tools &amp; Models</th> <th>HLE (%)</th> <th>FRAMES (%)</th> <th>$\tau^2Bench(<th>Cost(¢)</th><th>Latency(min)</th></tr></thead><tbody><tr><td>Notools,GPT5</td><td>23.4</td><td>66.3</td><td></td><td>6.2</td><td>4.1</td></tr><tr><td>Basictools+GPT5</td><td>35.1</td><td>74.0</td><td>77.7</td><td>30.2</td><td>19.8</td></tr><tr><td>Basic+specialist+LLMs</td><td>21.2</td><td>57.5</td><td>62.3</td><td>17.8</td><td>13.6</td></tr><tr><td>Orchestrator8B</td><td>37.1</td><td>76.3</td><td>80.2</td><td>9.2</td><td>8.2</td></tr></tbody></table></div><p>OnHLE,Orchestrator8Bachieves37.1-Bench (%)</th> <th>Cost (¢)</th> <th>Latency (min)</th> </tr> </thead><tbody><tr> <td>No tools, GPT-5</td> <td>23.4</td> <td>66.3</td> <td>—</td> <td>6.2</td> <td>4.1</td> </tr> <tr> <td>Basic tools + GPT-5</td> <td>35.1</td> <td>74.0</td> <td>77.7</td> <td>30.2</td> <td>19.8</td> </tr> <tr> <td>Basic+specialist+LLMs</td> <td>21.2</td> <td>57.5</td> <td>62.3</td> <td>17.8</td> <td>13.6</td> </tr> <tr> <td>Orchestrator-8B</td> <td>37.1</td> <td>76.3</td> <td>80.2</td> <td>9.2</td> <td>8.2</td> </tr> </tbody></table></div> <p>On HLE, Orchestrator-8B achieves 37.1% (vs. GPT-5: 35.1%) at ≈2.5× lower cost; on \tau^2$-Bench, Orchestrator-8B achieves 80.2% (vs. GPT-5 + basic tools: 77.7%) at ≈3× lower cost. Across all cost budgets, Orchestrator achieves better or equivalent accuracy/cost trade-off, dominating monolithic baselines (Su et al., 26 Nov 2025).

    Generalization studies reveal the orchestrator adapts to unseen tool descriptions (including Claude Sonnet, GPT-4o, Codestral-22B) and pricing, retaining superior performance and cost efficiency. Ablation experiments show removing the efficiency penalty increases accuracy but at more than double the cost; omitting preference rewards degrades controllability and user-alignment; scaling the orchestrator beyond 8B parameters exhibits diminishing returns for orchestration capacity.

    6. Practical Implications and Future Research Directions

    The ToolOrchestra approach demonstrates the efficacy of modular intelligence: composing a lightweight orchestrator over a landscape of domain experts offers superior performance-to-cost ratios compared to scaling up monolithic LLMs. This result underscores the scalability of tool-augmented reasoning, where new tools or models can be integrated without retraining the entire system.

    Several open research directions emerge:

    • Recursive orchestration, where the orchestrator delegates to subordinate orchestrators for complex sub-tasks.
    • Automated discovery and formal specification of new tools at run-time.
    • Human-in-the-loop feedback mechanisms for dynamic cost/latency trade-offs.
    • Theoretical guarantees concerning robustness and failure modes in multi-agent orchestration environments.

    In summary, ToolOrchestra operationalizes a framework that balances correctness, computational efficiency, and user-aligned preferences by orchestrating the invocation of diverse expert tools via a compact policy model, achieving state-of-the-art results at a fraction of the resource requirements of large, monolithic LLMs (Su et al., 26 Nov 2025).

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ToolOrchestra.