Papers
Topics
Authors
Recent
2000 character limit reached

Execution-Aware Tool-Jumping

Updated 30 December 2025
  • Execution-aware tool-jumping is a mechanism where LLM agents dynamically select and chain external tool invocations based on real-time feedback and execution traces.
  • It leverages reinforcement learning, formal planning languages, and cost/latency considerations to enhance security, efficiency, and multi-step task performance.
  • The approach integrates sequential decision-making and multi-modal signals, enabling adaptive tool chaining in robotics, code navigation, and dynamic service scheduling.

Execution-aware tool-jumping is the mechanism by which LLM agents or multi-modal policies coordinate, manage, and select external tool invocations during multi-step task execution, using feedback from the environment or execution traces. Rather than statically invoking specific tools or relying solely on prompt context, execution-aware tool-jumping leverages ongoing state, partial outputs, real-time signals, and cost or safety constraints to adaptively choose, chain, or switch between tools with high precision and responsiveness. This concept encompasses security-motivated exploit chains, reinforcement-learning-driven navigation, latency-optimized serving systems, and reasoning-driven cost minimization. The field covers agent robustness, end-to-end RL integration, formal planning languages supporting tool branching and dependencies, and practical LLM/plugin orchestration.

1. Conceptual Foundations and Formal Definitions

Execution-aware tool-jumping generalizes single-step agent actions to sequences where tool selection and chaining are guided by environment states, dynamic goals, or side-channel feedback. In the STAC framework, adversarial tool-jumping enables attacks by constructing sequences (a1,...,aT)(a_1, ..., a_T) where each ata_t is benign in isolation, but the cumulative sequence manipulates the environment state sts_t such that the final action aTa_T achieves a harmful effect undetectable by superficial prompt analysis (Li et al., 30 Sep 2025). Formally, with policy π\pi and deterministic transition T(st,at)T(s_t, a_t), the adversary seeks chains:

  • i<T\forall i < T, π(hi1,si1)    ai\pi(h_{i-1}, s_{i-1}) \implies a_i (passes safety checks)
  • T(si1,ai)=siT(s_{i-1}, a_i) = s_i (sets up preconditions)
  • aTa_T (on full micro-context) effects the malicious goal

Execution traces, history embeddings, and joint reasoning over agent state and tool outcomes underlie policy learning for robust tool-jumping and sequence adaptation.

2. Sequential Tool-Chaining and Security Analysis

STAC demonstrates execution-aware tool-jumping as a multi-turn, closed-loop pipeline for attack generation, verification, and stealth induction (Li et al., 30 Sep 2025):

  • Generator GG: Synthesizes candidate tool chains using planner-style prompting, targeting specific failure modes (Agent-SafetyBench taxonomy: harmful content, incorrect parameters, blind trust, etc.).
  • Verifier VV: Executes each chain in a sandbox, refines tool arguments, and ensures state transitions meet subgoal criteria.
  • Prompt Writer WW: Reverse-engineers benign-seeming prompts that reliably elicit the intended verified sequence under the agent's learned policy.
  • Planner PP: Interacts adaptively with the real agent, feeding constructed prompts, monitoring state, and recording whether final malicious actions are triggered.
  • Judge JJ: Scores each turn for harmlessness, goal progress, and helpfulness.

Attack Success Rate (ASR) is defined as ASR={Ci:(a1aT)(s0) reaches the harmful target state}/NASR = \left|\{ C_i : (a_1 \circ \cdots \circ a_T)(s_0) \text{ reaches the harmful target state} \} \right| / N for NN verified candidate chains.

Taxonomic failures (premature execution, omission, misparameterization, constraint violation) occur frequently in tool-jumping sequences, revealing critical vulnerabilities in current agent architectures.

3. Policy Learning and Reinforcement Feedback for Tool-Jumping

Execution-aware tool-jumping requires agents to not only select tools but also decide when to invoke them and how to parameterize calls based on trial-and-error feedback. The TRICE framework instantiates this through a two-stage pipeline (Qiao et al., 2023):

  • Stage I: Behavior cloning, teaching the agent to imitate weakly supervised tool-usage labels (from ChatGPT) for both self-solved and tool-requiring queries.
  • Stage II: Reinforcement Learning with Execution Feedback (RLEF), ranking alternative completions by final post-execution accuracy. The loss combines a ranking hinge (higher for execution-accurate candidates) and a supervised term enforcing syntactic tool-call correctness.

Mathematically, the policy πθ\pi_\theta is optimized with:

LRLEF=αCrank+LsftL_{RLEF} = \alpha C_{rank} + L_{sft}

where CrankC_{rank} orders completions by execution outcome, α\alpha tunes ranking vs. format sanity.

Empirical results show that execution feedback dramatically improves selective tool usage and accuracy while eliminating unnecessary or erroneous calls, supporting both "self-solve" and adaptive tool invocation.

4. Execution-Aware Tool-Jumping in Robotics and Multi-modal Agents

SwitchVLA introduces execution-aware tool-jumping to robotic Vision-Language-Action models by segmenting demonstration trajectories into temporally grounded contact phases and dynamically modulating behavior mode—forward, rollback, or advance—conditioned on execution state and changing instructions (Li et al., 4 Jun 2025). The policy embeds multi-modal signals:

  • st=(ot,qt,cpre)s_t = (o_t, q_t, c^{pre}) for observation, proprioception, and contact feedback
  • I=(lpre,lcur)I = (l^{pre}, l^{cur}) for instruction context

A transformer-based conditional decoder predicts next action chunk AtA_t and behavior mode btb_t, supporting seamless mid-execution tool switches in response to user intent or environmental changes. Real-time switching and interpolation yield robust multi-tool chaining, with ablations showing success rates up to 96%96\% on mid-phase task switches compared to prior baselines ($8$–11%11\%). This supports both intentional and recovery-driven tool jumps, enabling unified control for long horizon and multi-step tasks.

5. RL-Driven, Execution-Structured Tool-Jumping for Code Navigation

RepoNavigator streamlines tool-jumping for repository-level LLM agents by formalizing a single, execution-aware jump operation: at each agent turn tt, the policy πθ\pi_\theta chooses either a reasoning action or a JSON-formatted tool call to jump to a symbol's definition within the codebase (Zhang et al., 24 Dec 2025). The jump tool invokes precise static analysis (e.g., Pyright's AST parsing, LEGB resolution) returning relevant code context as the observation stream.

The RL setup optimizes sequence localization by maximizing expected reward:

J(θ)=Eτπθ[R(Y^,Y,τ)],J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\hat{Y}, Y^*, \tau)],

where RR includes DICE score for ground-truth location and S(τ)S(\tau) for successful tool calls.

This execution-structured approach reduces error propagation and search scope, yielding higher success rates and matching real invocation flows compared to multi-tool or retrieval-based agents.

6. Cost and Latency-Aware Tool Planning in LLM Agents

CATP-LLM advances execution-aware tool-jumping via cost-aware planning, using a formal Tool Planning Language (TPL) to describe arbitrary branched (DAG-structured) tool chains and a cost-augmented offline RL algorithm (Decision Transformer-based) to optimize final plan quality (Wu et al., 2024). Each plan τ=(t1,d1,...,tn)\tau = (t_1, d_1, ..., t_n) yields task performance P(τ)P(\tau) and execution cost C(τ)C(\tau), with policy learning maximizing expected net utility

maxπEx,τ[P(τ)λC(τ)]\max_\pi \, \mathbb{E}_{x, \tau} \left[ P(\tau) - \lambda C(\tau) \right]

Context embedding incorporates cost vectors, and training penalizes expensive intermediate expansions and rewards high-performance, low-cost completion. CATP-LLM achieves up to 30.2%30.2\% higher plan performance and 45.8%45.8\% lower cost than GPT-4 prompting, while guaranteeing 100%100\% valid plans.

7. Efficient Serving and Scheduling for Execution-Aware Tool-Jumping

Conveyor implements execution-aware tool-jumping with partial execution alongside ongoing LLM decoding (Xu et al., 2024). The system overlaps tool execution and GPU-based token generation, reducing end-to-end latency for multi-stage workloads. The central parser/plugin API enables streaming detection of tool calls as soon as enough LLM output is available; the scheduler non-blockingly polls tool processes, injecting outputs to the next LLM decoding pass.

Formally, Conveyor's model bounds latency as:

Lnew=i=1nmax{gi,ti}+gn+1L_{new} = \sum_{i=1}^n \max\{g_i, t_i\} + g_{n+1}

where gig_i are LLM decoding times and tit_i tool execution times.

Empirical speedups reach 38.8%38.8\% for planning workloads, with overhead under 1%1\% and plugin adaptation requiring minimal code. The system's efficiency depends on matching tool and decode latency; future work will involve richer dependency tracking and predictive scheduling.


Execution-aware tool-jumping constitutes a broad and rapidly evolving methodological substrate for reliable, efficient, and secure LLM agent operation across planning, coding, robotics, and serving. Advances in sequential chain reasoning, joint policy adaptation, cost/performance tradeoff optimization, and system-level scheduling have enabled agents to reason over extended execution traces, adapt to real-time context, and orchestrate multi-tool pipelines with robust safety and task fidelity. Continued research targets scalability, cross-language generalization, dynamic scheduling, and defense against emergent chain vulnerabilities.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Execution-Aware Tool-Jumping.