A Plan Reuse Mechanism for LLM-Driven Agent

Published 24 Dec 2025 in cs.MA | (2512.21309v2)

Abstract: Integrating LLMs into personal assistants, like Xiao Ai and Blue Heart V, effectively enhances their ability to interact with humans, solve complex tasks, and manage IoT devices. Such assistants are also termed LLM-driven agents. Upon receiving user requests, the LLM-driven agent generates plans using an LLM, executes these plans through various tools, and then returns the response to the user. During this process, the latency for generating a plan with an LLM can reach tens of seconds, significantly degrading user experience. Real-world dataset analysis shows that about 30% of the requests received by LLM-driven agents are identical or similar, which allows the reuse of previously generated plans to reduce latency. However, it is difficult to accurately define the similarity between the request texts received by the LLM-driven agent through directly evaluating the original request texts. Moreover, the diverse expressions of natural language and the unstructured format of plan texts make implementing plan reuse challenging. To address these issues, we present and implement a plan reuse mechanism for LLM-driven agents called AgentReuse. AgentReuse leverages the similarities and differences among requests' semantics and uses intent classification to evaluate the similarities between requests and enable the reuse of plans. Experimental results based on a real-world dataset demonstrate that AgentReuse achieves a 93% effective plan reuse rate, an F1 score of 0.9718, and an accuracy of 0.9459 in evaluating request similarities, reducing latency by 93.12% compared with baselines without using the reuse mechanism.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AgentReuse, an intent-and-parameter-aware framework that achieves a 93% plan reuse rate and reduces latency by over 93%.
It employs BERT-based intent classification and cosine similarity on deparameterized queries, improving F1 scores by up to 6.8 points compared to baselines.
The approach streamlines plan generation in LLM-driven agents, offering robust key parameter extraction and structured plan representation for AIoT and mobile applications.

Plan Reuse for LLM-Driven Agents: The AgentReuse Mechanism

Introduction and Motivation

The integration of LLMs into autonomous agents for personal assistance and AIoT tasks has dramatically expanded the expressive and reasoning capacity of such systems. However, the computational and latency costs associated with LLM-based plan generation remain a significant bottleneck. Data-driven analysis demonstrates that approximately 30% of requests to LLM-driven agents are semantically identical or similar, rendering them amenable to plan reuse. However, naive semantic caching approaches, such as vector-based LLM response caching, are insufficient for agent plan reuse due to entanglement of key parameters and unstructured plan representation.

Workflow of LLM-Driven Agents

The canonical workflow employed by LLM-driven agents comprises four phases: (1) the agent receives a user request in natural language, (2) the LLM generates a decomposed plan, (3) the agent dispatches sub-tasks to external tools and aggregates results, and (4) the agent produces a user-facing response. This structure exposes the plan generation latency $t_p$ as a dominant factor in the end-to-end response time due to auto-regressive plan synthesis and typically lengthy output sequences.

Figure 1: High-level workflow in the LLM-driven agent pipeline, isolating the plan generation and execution stages.

Eliminating redundant $t_p$ for similar or identical requests can thus directly improve responsiveness and overall user experience.

Challenges in Plan Reuse

Three core challenges must be resolved to operationalize plan reuse in LLM-driven agent architectures:

Similarity Definition: Classical embedding-based similarity on raw request strings often fails to recognize structurally equivalent requests with different parameterizations; e.g., "Book a ticket from Hefei to Beijing for the day after tomorrow" vs. "Book a ticket from Changsha to Shanghai for tomorrow."
Parameter Identification and Replacement: Extracting and slot-filling key parameters (e.g., location, date) is essential, as naive string matching is confounded by natural language variability.
Structured Plan Representation: Unstructured textual plans generated by the LLM lack the explicit structure for reliable parameter injection and component-level reuse.

The AgentReuse Mechanism

AgentReuse addresses these challenges through an intent-and-parameter-aware framework, incorporating intent classification, robust parameter identification, and structured plan representation.

Figure 2: Overview of AgentReuse’s operational flow, including intent classification, key parameter identification, and structured plan entry and retrieval.

Key components include:

Intent Classification: Requests are clustered into intent categories using BERT-based models, confining retrieval to semantically consistent class subsets and lowering both false positive rate and retrieval cost.
Key Parameter Extraction: Joint intent/slot models (e.g., BERT-like NLU) are used to identify and remove key parameters from request queries, producing a normalized template form.
Similarity Calculation: Vector representations (m3e-small) are computed on deparameterized queries ( $q_i^-$ ). Cosine similarity search (via FAISS IndexFlatIP with normalization) is executed for intra-intent querying; reuse is triggered when similarity exceeds a threshold $\gamma$ .
Structured Plan Representation: Prompting strategies coerce the LLM to emit plans with explicit step boundaries, dependencies, and parameter slots. These annotated plans are subsequently parsed into a structured form amenable to parameter substitution and reliable sub-task execution attribution.
Figure 3: Structural execution graph extracted from a plan, capturing sub-task dependencies and slot mapping for effective reuse in new queries.

Experimental Results

Ablation studies and empirical evaluations on the SMP dialog dataset (2,664 requests, 23 intents) demonstrate strong performance advantages of AgentReuse over prior semantic response caching approaches (GPTCache, MeanCache) and relevant ablations (OneIntent, WithArgs).

Plan Reuse Accuracy: AgentReuse achieves an effective plan reuse rate of 93%.
Semantic Similarity Definition: F1 score of 0.9718 and accuracy of 0.9459 at $\gamma=0.75$ , respectively 6.8 and 13.06 points higher than the strongest competing baseline.
Latency Reduction: End-to-end system latency is reduced by 93.12% compared to agents without reuse, and by 60.61% compared to GPTCache-based solutions.

Figure 4: F1 score as a function of the reuse threshold $\gamma$ , highlighting consistent superiority of AgentReuse across the precision-recall trade-off.

Further, performance is robust to threshold selection. Extraction of key parameters prior to similarity calculation is pivotal; the WithArgs variant, which omits this, exhibits substantially reduced recall and F1.

Figure 5: Comparative analysis of similarity scores between AgentReuse (with argument removal) and WithArgs (no argument removal), confirming the importance of parameter delexicalization.

Systemic and Resource Considerations

AgentReuse is designed for tractable system overhead in the deployment context of AIoT and mobile agents:

VRAM/Memory: The largest model (bert-base-chinese) introduces ~100 MB VRAM pressure; vector/cached plan storage per request is sub-MB.
Processing Latency: Reuse decision and plan adaptation add ~23 ms/request, which is negligible compared to LLM plan generation.
Baseline Integration: Most commercial assistants already incur the cost of intent classification; incremental overhead is minimal.
Figure 6: Latency breakdown per request for different reuse/caching methods, emphasizing that AgentReuse’s decision-making is dominated by classification and vector search steps.

Theoretical and Practical Implications

The approach demonstrates that the key to effective plan reuse in LLM-driven agents is fine-grained control over intent and argument structure, rather than naive text similarity. Caching at the plan granularity, coupled with structured abstraction, bridges the gap between conventional syntax-based cache/memoization and semantically robust agent reuse.

Crucially, the findings challenge common assumptions in LLM response caching: for agent pipelines, the correct reuse target is the plan, not the response, due to the prevalence of parametric variation and the downstream integration of real-time, personalized, or environment-coupled execution.

The method is extensible: multi-intent parsing, explicit causality modeling for compound requests, and richer plan graph abstractions can further improve real-world applicability.

Outlook and Future Directions

Future work includes enhancing multi-intent classification and slot-filling for ambitious AIoT scenarios (e.g., simultaneous multi-device orchestration), leveraging execution tracing in serverless/containerized environments for richer plan instrumentation, and exploring dynamic adaptation of similarity thresholds and plan abstraction levels as user and environment context evolves.

Adoption of plan-level caching has practical benefits for cost amortization (in compute/budget-constrained settings) and theoretical value for the study of compositional generalization and agent tool-use abstraction.

Conclusion

AgentReuse establishes a comprehensive and empirically validated framework for plan reuse in LLM-driven agents, effectively addressing ambiguity in request similarity, robust key parameter handling, and structured plan representation. The results demonstrate that orchestrating plan reuse at the semantic and structural level yields substantial gains in latency, accuracy, and practical deployability for agent-based AI systems, with applicability to both IoT scenarios and broader domains where LLM-driven agents are operational (2512.21309).

Markdown