Tulip Agent Architecture

Updated 28 November 2025

Tulip Agent is an architecture that decouples tool metadata from LLM prompts using a vector store-backed CRUD system.
It leverages semantic search and recursive task decomposition to efficiently retrieve and execute the most relevant tools.
Empirical studies demonstrate 2–3× cost reductions and high accuracy across mathematical and robotics applications.

Tulip Agent is an architecture for autonomous LLM–based agents that enables scalable, dynamic, and efficient use of large tool libraries through Create, Read, Update, and Delete (CRUD) access. Distinct from prior agent tool-use paradigms, Tulip decouples the representation and retrieval of tools from the LLM's prompt context, reducing inference costs, overcoming context window limitations, and permitting runtime tool set evolution. Tulip leverages a vector store–backed tool library with semantic search and recursive task decomposition, demonstrating robust performance across mathematical and robotics domains (Ocker et al., 2024).

1. Design Goals and Motivations

Tulip Agent was developed to address three principal shortcomings of tool-augmented LLM agents: (1) elevated inference costs from encoding all tool descriptions in prompts; (2) LLMs' limited ability to select appropriate tools when presented with large tool sets in-context; and (3) static, upfront tool selection that prevents runtime extension or modification. The architectural objectives are:

Decoupling tool descriptions from context: Maintaining an external, non-parametric vector store for tool metadata avoids prompt bloat and context-window saturation.
Semantic tool retrieval at scale: Enabling sublinear search over arbitrarily large tool sets using vector similarity search (with recursive decomposition) reduces the combinatorial burden of tool selection.
On-the-fly tool adaptation: Allowing the agent direct CRUD access to its own tool library enables dynamic extension, refinement, or curation of available capabilities.

Contrasted with previous state-of-the-art “function-calling” or “Tool API” approaches, Tulip does not encode all $N$ tool descriptions in the system prompt. Instead, it retrieves only the $k \ll N$ most relevant tools by embedding similarity, and supports tool library mutation through agent-invokable meta-tools (Ocker et al., 2024).

2. Formal CRUD Operations and Tool Representation

Let $T$ denote the (dynamic) set of tool descriptors. Each tool $t \in T$ is defined by:

$\mathrm{id}(t)$ : unique identifier,
$\mathrm{desc}(t)$ : natural-language description,
$e_t = \mathrm{EMBED}(\mathrm{desc}(t))$ : embedding vector,
$\mathrm{impl}(t)$ : executable Python implementation.

The vector store supports:

Retrieve( $q$ ; $k$ , $\tau$ ): Given a subtask description $q$ , returns top- $k$ tools $t$ such that $\mathrm{sim}(e_q, e_t) \geq \tau$ , where $\mathrm{sim}$ is cosine similarity or negative squared $L_2$ . Complexity is $O(\log |T|)$ with HNSW indexing.
Insert( $t_\text{new}$ ): Embeds $\mathrm{desc}(t_\text{new})$ and inserts $t_\text{new}$ to $T$ and the index. Amortized $O(\log |T|)$ complexity.
Delete( $\mathrm{id}(t)$ ): Removes $t$ from $T$ and the index. $O(\log |T|)$ complexity.
Update( $\mathrm{id}$ , $\Delta\mathrm{desc}$ ): Updates description/code, re-embeds, deletes old instance, and inserts the new version. $O(\log|T|)$ plus re-analysis cost.

A cost model is introduced:

$C_\text{LLM}(\text{tokens}) = \alpha_\text{in} \cdot \#$ input tokens $+ \alpha_\text{out} \cdot \#$ output tokens,
$C_\text{emb}(N) = \beta \cdot N$ (per embedding call),
Total inference cost is $C_\text{LLM} + C_\text{emb}$ (Ocker et al., 2024).

3. Recursive Tool Search and Agent Execution Workflow

Processing proceeds as follows:

Initialization: Tool modules are imported, introspected for function metadata (names, docstrings, parameters), and embedded. The vector store and lookup table are constructed.
User Query Handling: The user’s request is decomposed via a chain-of-thought (CoT) prompt into subtasks $P = M_\text{td}(q)$ .
Recursive Tool Search: For each subtask, the agent embeds its description and retrieves top- $k$ candidates. If no match exceeds the similarity threshold and recursion depth allows, the subtask is further decomposed and the search recurses.
Action Generation and Execution: The LLM receives only the selected tool descriptions (“context tools”) and proposed actions; it parses and executes calls. Feedback loops permit additional tool selection or further decomposition if required.

The high-level pseudocode (abstracted):

def QUERY(user_query):
    subtasks = LLM_decompose(user_query)
    def SEARCH_TOOLS(task_desc, depth=0):
        e_q = EMBED(task_desc)
        candidates = VectorStore.retrieve(task_desc, top_k, τ)
        if not candidates and depth < MAX_DEPTH:
            finer = LLM_decompose_single(task_desc)
            return [SEARCH_TOOLS(s, depth+1) for s in finer]
        else:
            return candidates
    tools_for = {tsk: SEARCH_TOOLS(tsk) for tsk in subtasks}
    actions = LLM_generate_calls(subtasks, tools_for)
    for action in actions:
        result = Lookup[action.name](**action.params)

Relevant key equations:

Tool retrieval: $T^* = M_s(P, T) = \{ t : \mathrm{sim}(e_q, e_t) \geq \tau \}_{\text{top-$k$}}$
Task decomposition: $P = M_\text{td}(q)$ (Ocker et al., 2024).

4. Cost Analysis and Scaling

Consider $N$ tools (e.g., $100$), each with $L \approx 50$ tokens of description, and $k$ retrieved candidates per subtask, with $M \approx 3$ subtasks per query.

Baseline (prompting with all tools): Context size $L \cdot N$ ( $\sim$ 5,000 tokens for $N=100$ ), high $C_\text{LLM}$ per query.
Tulip Agent: Embedding cost for $N$ tools (precomputed), per-query retrieval for $M$ subtasks ( $\sim$ 3 embedding calls), and only $k \cdot L$ tool tokens per subtask included in prompts (total $\sim$ 250 $–$ 300 tokens), sharply reducing $C_\text{LLM}$ .

Empirical findings on the mathematics benchmark (100 tools, “Hard” tasks):

Agent	Correctness	Cost (USD)
BaseAgent	0.50	0.001
CotToolAgent	0.51	0.012
CotTulipAgent	0.46	0.008
PrimedCotTulipAgent	0.55	0.004

These results indicate $2$– $3\times$ reduction in per-task cost for Tulip variants over classic tool prompt baselines, with preserved or improved correctness for non-trivial tasks (Ocker et al., 2024).

5. Empirical Studies and Benchmarks

Three ablation types are reported across $60$ math tasks (stratified by tool count): Easy (1 tool), Medium (2–3 tools), Hard ( $\geq4$ tools). Metrics include correctness, precision/recall for tool calls, and dollar cost.

LLM Model Ablation: On Hard tasks, correctness for gpt-3.5-turbo-0125 can dip to $0.46$–$0.55$ for Tulip, whereas gpt-4 variants maintain $0.96$–$1.00$.
Embedding Model Ablation: Differences between OpenAI’s ada-002 and text-embedding-3-* have minimal effect on advanced CoT-enabled Tulip agents, but matter for MinimalTulipAgent.
Retrieval Size Ablation: For PrimedCotTulipAgent, increasing $k$ above $5$ yields no further accuracy gains, confirming that decompositional planning enhances retrieval efficiency.

These studies support the claims that semantic search with CoT-based decomposition delivers high tool selection precision and recall at reduced computational cost, and that tool-based decomposition is essential for complex multi-step problems (Ocker et al., 2024).

6. Application to Embodied Robotics

Tulip Agent generalizes beyond mathematical task-solving to embodied AI. In a simulated tabletop robotics environment, CotTulipAgent is used as the planner, given high-level instructions such as “Hand the glass_blue over to Felix” and “Pour cola into Daniel’s glass, then hand it to him.” Tools in the robot’s library, e.g., pour_into(source, target), hand_over(object, human), are retrieved and sequenced according to decomposed subtasks.

CotTulipAgent decomposes each instruction, retrieves relevant tools at each stage, executes the sequence, and achieves the desired robot behavior without additional prompt engineering. This suggests the architecture is robust to domain transfer and supports both symbolic and embodied tasks (Ocker et al., 2024).

7. Implementation Details

Tulip Agent is implemented in Python 3.10, with ChromaDB + HNSW providing vector-store infrastructure, OpenAI LLM APIs (gpt-3.5-turbo-0125, gpt-4-turbo) for language reasoning, and text-embedding-ada-002/text-embedding-3-* for semantic tool embedding. Function introspection uses Python AST and Sphinx-style docstrings, and importlib enables runtime dynamic loading.

Runtime tool extension involves the following workflow:

CreateTool: The agent invokes code-generation LLMs to create Python function stubs with docstrings and type hints.
The generated code is syntax-checked and sandbox-executed.
Successful artifacts are written to disk, introspected for name/doc, embedded, and registered in both vector store and lookup mapping.
Updating tools is similar, but the prompt is seeded with prior code and edit instructions; deletion removes the tool from the database and disk as appropriate.

This infrastructure makes Tulip Agent the first open-source LLM-agent framework offering extensible, semantic-searchable tool libraries with native runtime CRUD operations, recursive task decomposition, and demonstrated scalability to hundreds or thousands of tools with substantial cost efficiency and consistently high task accuracy (Ocker et al., 2024).

Markdown Upgrade to Chat

References (1)

Tulip Agent -- Enabling LLM-Based Agents to Solve Tasks Using Large Tool Libraries (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tulip Agent.