Tulip Agent Architecture
- Tulip Agent is an architecture that decouples tool metadata from LLM prompts using a vector store-backed CRUD system.
- It leverages semantic search and recursive task decomposition to efficiently retrieve and execute the most relevant tools.
- Empirical studies demonstrate 2–3× cost reductions and high accuracy across mathematical and robotics applications.
Tulip Agent is an architecture for autonomous LLM–based agents that enables scalable, dynamic, and efficient use of large tool libraries through Create, Read, Update, and Delete (CRUD) access. Distinct from prior agent tool-use paradigms, Tulip decouples the representation and retrieval of tools from the LLM's prompt context, reducing inference costs, overcoming context window limitations, and permitting runtime tool set evolution. Tulip leverages a vector store–backed tool library with semantic search and recursive task decomposition, demonstrating robust performance across mathematical and robotics domains (Ocker et al., 31 Jul 2024).
1. Design Goals and Motivations
Tulip Agent was developed to address three principal shortcomings of tool-augmented LLM agents: (1) elevated inference costs from encoding all tool descriptions in prompts; (2) LLMs' limited ability to select appropriate tools when presented with large tool sets in-context; and (3) static, upfront tool selection that prevents runtime extension or modification. The architectural objectives are:
- Decoupling tool descriptions from context: Maintaining an external, non-parametric vector store for tool metadata avoids prompt bloat and context-window saturation.
- Semantic tool retrieval at scale: Enabling sublinear search over arbitrarily large tool sets using vector similarity search (with recursive decomposition) reduces the combinatorial burden of tool selection.
- On-the-fly tool adaptation: Allowing the agent direct CRUD access to its own tool library enables dynamic extension, refinement, or curation of available capabilities.
Contrasted with previous state-of-the-art “function-calling” or “Tool API” approaches, Tulip does not encode all tool descriptions in the system prompt. Instead, it retrieves only the most relevant tools by embedding similarity, and supports tool library mutation through agent-invokable meta-tools (Ocker et al., 31 Jul 2024).
2. Formal CRUD Operations and Tool Representation
Let denote the (dynamic) set of tool descriptors. Each tool is defined by:
- : unique identifier,
- : natural-language description,
- : embedding vector,
- : executable Python implementation.
The vector store supports:
- Retrieve(; , ): Given a subtask description , returns top- tools such that , where is cosine similarity or negative squared . Complexity is with HNSW indexing.
- Insert(): Embeds and inserts to and the index. Amortized complexity.
- Delete(): Removes from and the index. complexity.
- Update(, ): Updates description/code, re-embeds, deletes old instance, and inserts the new version. plus re-analysis cost.
A cost model is introduced:
- input tokens output tokens,
- (per embedding call),
- Total inference cost is (Ocker et al., 31 Jul 2024).
3. Recursive Tool Search and Agent Execution Workflow
Processing proceeds as follows:
- Initialization: Tool modules are imported, introspected for function metadata (names, docstrings, parameters), and embedded. The vector store and lookup table are constructed.
- User Query Handling: The user’s request is decomposed via a chain-of-thought (CoT) prompt into subtasks .
- Recursive Tool Search: For each subtask, the agent embeds its description and retrieves top- candidates. If no match exceeds the similarity threshold and recursion depth allows, the subtask is further decomposed and the search recurses.
- Action Generation and Execution: The LLM receives only the selected tool descriptions (“context tools”) and proposed actions; it parses and executes calls. Feedback loops permit additional tool selection or further decomposition if required.
The high-level pseudocode (abstracted):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def QUERY(user_query): subtasks = LLM_decompose(user_query) def SEARCH_TOOLS(task_desc, depth=0): e_q = EMBED(task_desc) candidates = VectorStore.retrieve(task_desc, top_k, τ) if not candidates and depth < MAX_DEPTH: finer = LLM_decompose_single(task_desc) return [SEARCH_TOOLS(s, depth+1) for s in finer] else: return candidates tools_for = {tsk: SEARCH_TOOLS(tsk) for tsk in subtasks} actions = LLM_generate_calls(subtasks, tools_for) for action in actions: result = Lookup[action.name](**action.params) |
Relevant key equations:
- Tool retrieval: $T^* = M_s(P, T) = \{ t : \mathrm{sim}(e_q, e_t) \geq \tau \}_{\text{top-$k$}}$
- Task decomposition: (Ocker et al., 31 Jul 2024).
4. Cost Analysis and Scaling
Consider tools (e.g., $100$), each with tokens of description, and retrieved candidates per subtask, with subtasks per query.
- Baseline (prompting with all tools): Context size (5,000 tokens for ), high per query.
- Tulip Agent: Embedding cost for tools (precomputed), per-query retrieval for subtasks (3 embedding calls), and only tool tokens per subtask included in prompts (total 250300 tokens), sharply reducing .
Empirical findings on the mathematics benchmark (100 tools, “Hard” tasks):
| Agent | Correctness | Cost (USD) |
|---|---|---|
| BaseAgent | 0.50 | 0.001 |
| CotToolAgent | 0.51 | 0.012 |
| CotTulipAgent | 0.46 | 0.008 |
| PrimedCotTulipAgent | 0.55 | 0.004 |
These results indicate $2$– reduction in per-task cost for Tulip variants over classic tool prompt baselines, with preserved or improved correctness for non-trivial tasks (Ocker et al., 31 Jul 2024).
5. Empirical Studies and Benchmarks
Three ablation types are reported across $60$ math tasks (stratified by tool count): Easy (1 tool), Medium (2–3 tools), Hard ( tools). Metrics include correctness, precision/recall for tool calls, and dollar cost.
- LLM Model Ablation: On Hard tasks, correctness for gpt-3.5-turbo-0125 can dip to $0.46$–$0.55$ for Tulip, whereas gpt-4 variants maintain $0.96$–$1.00$.
- Embedding Model Ablation: Differences between OpenAI’s ada-002 and text-embedding-3-* have minimal effect on advanced CoT-enabled Tulip agents, but matter for MinimalTulipAgent.
- Retrieval Size Ablation: For PrimedCotTulipAgent, increasing above $5$ yields no further accuracy gains, confirming that decompositional planning enhances retrieval efficiency.
These studies support the claims that semantic search with CoT-based decomposition delivers high tool selection precision and recall at reduced computational cost, and that tool-based decomposition is essential for complex multi-step problems (Ocker et al., 31 Jul 2024).
6. Application to Embodied Robotics
Tulip Agent generalizes beyond mathematical task-solving to embodied AI. In a simulated tabletop robotics environment, CotTulipAgent is used as the planner, given high-level instructions such as “Hand the glass_blue over to Felix” and “Pour cola into Daniel’s glass, then hand it to him.” Tools in the robot’s library, e.g., pour_into(source, target), hand_over(object, human), are retrieved and sequenced according to decomposed subtasks.
CotTulipAgent decomposes each instruction, retrieves relevant tools at each stage, executes the sequence, and achieves the desired robot behavior without additional prompt engineering. This suggests the architecture is robust to domain transfer and supports both symbolic and embodied tasks (Ocker et al., 31 Jul 2024).
7. Implementation Details
Tulip Agent is implemented in Python 3.10, with ChromaDB + HNSW providing vector-store infrastructure, OpenAI LLM APIs (gpt-3.5-turbo-0125, gpt-4-turbo) for language reasoning, and text-embedding-ada-002/text-embedding-3-* for semantic tool embedding. Function introspection uses Python AST and Sphinx-style docstrings, and importlib enables runtime dynamic loading.
Runtime tool extension involves the following workflow:
- CreateTool: The agent invokes code-generation LLMs to create Python function stubs with docstrings and type hints.
- The generated code is syntax-checked and sandbox-executed.
- Successful artifacts are written to disk, introspected for name/doc, embedded, and registered in both vector store and lookup mapping.
- Updating tools is similar, but the prompt is seeded with prior code and edit instructions; deletion removes the tool from the database and disk as appropriate.
This infrastructure makes Tulip Agent the first open-source LLM-agent framework offering extensible, semantic-searchable tool libraries with native runtime CRUD operations, recursive task decomposition, and demonstrated scalability to hundreds or thousands of tools with substantial cost efficiency and consistently high task accuracy (Ocker et al., 31 Jul 2024).