Papers
Topics
Authors
Recent
2000 character limit reached

Tulip Agent Architecture

Updated 28 November 2025
  • Tulip Agent is an architecture that decouples tool metadata from LLM prompts using a vector store-backed CRUD system.
  • It leverages semantic search and recursive task decomposition to efficiently retrieve and execute the most relevant tools.
  • Empirical studies demonstrate 2–3× cost reductions and high accuracy across mathematical and robotics applications.

Tulip Agent is an architecture for autonomous LLM–based agents that enables scalable, dynamic, and efficient use of large tool libraries through Create, Read, Update, and Delete (CRUD) access. Distinct from prior agent tool-use paradigms, Tulip decouples the representation and retrieval of tools from the LLM's prompt context, reducing inference costs, overcoming context window limitations, and permitting runtime tool set evolution. Tulip leverages a vector store–backed tool library with semantic search and recursive task decomposition, demonstrating robust performance across mathematical and robotics domains (Ocker et al., 31 Jul 2024).

1. Design Goals and Motivations

Tulip Agent was developed to address three principal shortcomings of tool-augmented LLM agents: (1) elevated inference costs from encoding all tool descriptions in prompts; (2) LLMs' limited ability to select appropriate tools when presented with large tool sets in-context; and (3) static, upfront tool selection that prevents runtime extension or modification. The architectural objectives are:

  • Decoupling tool descriptions from context: Maintaining an external, non-parametric vector store for tool metadata avoids prompt bloat and context-window saturation.
  • Semantic tool retrieval at scale: Enabling sublinear search over arbitrarily large tool sets using vector similarity search (with recursive decomposition) reduces the combinatorial burden of tool selection.
  • On-the-fly tool adaptation: Allowing the agent direct CRUD access to its own tool library enables dynamic extension, refinement, or curation of available capabilities.

Contrasted with previous state-of-the-art “function-calling” or “Tool API” approaches, Tulip does not encode all NN tool descriptions in the system prompt. Instead, it retrieves only the kNk \ll N most relevant tools by embedding similarity, and supports tool library mutation through agent-invokable meta-tools (Ocker et al., 31 Jul 2024).

2. Formal CRUD Operations and Tool Representation

Let TT denote the (dynamic) set of tool descriptors. Each tool tTt \in T is defined by:

  • id(t)\mathrm{id}(t): unique identifier,
  • desc(t)\mathrm{desc}(t): natural-language description,
  • et=EMBED(desc(t))e_t = \mathrm{EMBED}(\mathrm{desc}(t)): embedding vector,
  • impl(t)\mathrm{impl}(t): executable Python implementation.

The vector store supports:

  • Retrieve(qq; kk, τ\tau): Given a subtask description qq, returns top-kk tools tt such that sim(eq,et)τ\mathrm{sim}(e_q, e_t) \geq \tau, where sim\mathrm{sim} is cosine similarity or negative squared L2L_2. Complexity is O(logT)O(\log |T|) with HNSW indexing.
  • Insert(tnewt_\text{new}): Embeds desc(tnew)\mathrm{desc}(t_\text{new}) and inserts tnewt_\text{new} to TT and the index. Amortized O(logT)O(\log |T|) complexity.
  • Delete(id(t)\mathrm{id}(t)): Removes tt from TT and the index. O(logT)O(\log |T|) complexity.
  • Update(id\mathrm{id}, Δdesc\Delta\mathrm{desc}): Updates description/code, re-embeds, deletes old instance, and inserts the new version. O(logT)O(\log|T|) plus re-analysis cost.

A cost model is introduced:

  • CLLM(tokens)=αin#C_\text{LLM}(\text{tokens}) = \alpha_\text{in} \cdot \#input tokens +αout#+ \alpha_\text{out} \cdot \#output tokens,
  • Cemb(N)=βNC_\text{emb}(N) = \beta \cdot N (per embedding call),
  • Total inference cost is CLLM+CembC_\text{LLM} + C_\text{emb} (Ocker et al., 31 Jul 2024).

3. Recursive Tool Search and Agent Execution Workflow

Processing proceeds as follows:

  1. Initialization: Tool modules are imported, introspected for function metadata (names, docstrings, parameters), and embedded. The vector store and lookup table are constructed.
  2. User Query Handling: The user’s request is decomposed via a chain-of-thought (CoT) prompt into subtasks P=Mtd(q)P = M_\text{td}(q).
  3. Recursive Tool Search: For each subtask, the agent embeds its description and retrieves top-kk candidates. If no match exceeds the similarity threshold and recursion depth allows, the subtask is further decomposed and the search recurses.
  4. Action Generation and Execution: The LLM receives only the selected tool descriptions (“context tools”) and proposed actions; it parses and executes calls. Feedback loops permit additional tool selection or further decomposition if required.

The high-level pseudocode (abstracted):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def QUERY(user_query):
    subtasks = LLM_decompose(user_query)
    def SEARCH_TOOLS(task_desc, depth=0):
        e_q = EMBED(task_desc)
        candidates = VectorStore.retrieve(task_desc, top_k, τ)
        if not candidates and depth < MAX_DEPTH:
            finer = LLM_decompose_single(task_desc)
            return [SEARCH_TOOLS(s, depth+1) for s in finer]
        else:
            return candidates
    tools_for = {tsk: SEARCH_TOOLS(tsk) for tsk in subtasks}
    actions = LLM_generate_calls(subtasks, tools_for)
    for action in actions:
        result = Lookup[action.name](**action.params)

Relevant key equations:

  • Tool retrieval: $T^* = M_s(P, T) = \{ t : \mathrm{sim}(e_q, e_t) \geq \tau \}_{\text{top-$k$}}$
  • Task decomposition: P=Mtd(q)P = M_\text{td}(q) (Ocker et al., 31 Jul 2024).

4. Cost Analysis and Scaling

Consider NN tools (e.g., $100$), each with L50L \approx 50 tokens of description, and kk retrieved candidates per subtask, with M3M \approx 3 subtasks per query.

  • Baseline (prompting with all tools): Context size LNL \cdot N (\sim5,000 tokens for N=100N=100), high CLLMC_\text{LLM} per query.
  • Tulip Agent: Embedding cost for NN tools (precomputed), per-query retrieval for MM subtasks (\sim3 embedding calls), and only kLk \cdot L tool tokens per subtask included in prompts (total \sim250300 tokens), sharply reducing CLLMC_\text{LLM}.

Empirical findings on the mathematics benchmark (100 tools, “Hard” tasks):

Agent Correctness Cost (USD)
BaseAgent 0.50 0.001
CotToolAgent 0.51 0.012
CotTulipAgent 0.46 0.008
PrimedCotTulipAgent 0.55 0.004

These results indicate $2$–3×3\times reduction in per-task cost for Tulip variants over classic tool prompt baselines, with preserved or improved correctness for non-trivial tasks (Ocker et al., 31 Jul 2024).

5. Empirical Studies and Benchmarks

Three ablation types are reported across $60$ math tasks (stratified by tool count): Easy (1 tool), Medium (2–3 tools), Hard (4\geq4 tools). Metrics include correctness, precision/recall for tool calls, and dollar cost.

  • LLM Model Ablation: On Hard tasks, correctness for gpt-3.5-turbo-0125 can dip to $0.46$–$0.55$ for Tulip, whereas gpt-4 variants maintain $0.96$–$1.00$.
  • Embedding Model Ablation: Differences between OpenAI’s ada-002 and text-embedding-3-* have minimal effect on advanced CoT-enabled Tulip agents, but matter for MinimalTulipAgent.
  • Retrieval Size Ablation: For PrimedCotTulipAgent, increasing kk above $5$ yields no further accuracy gains, confirming that decompositional planning enhances retrieval efficiency.

These studies support the claims that semantic search with CoT-based decomposition delivers high tool selection precision and recall at reduced computational cost, and that tool-based decomposition is essential for complex multi-step problems (Ocker et al., 31 Jul 2024).

6. Application to Embodied Robotics

Tulip Agent generalizes beyond mathematical task-solving to embodied AI. In a simulated tabletop robotics environment, CotTulipAgent is used as the planner, given high-level instructions such as “Hand the glass_blue over to Felix” and “Pour cola into Daniel’s glass, then hand it to him.” Tools in the robot’s library, e.g., pour_into(source, target), hand_over(object, human), are retrieved and sequenced according to decomposed subtasks.

CotTulipAgent decomposes each instruction, retrieves relevant tools at each stage, executes the sequence, and achieves the desired robot behavior without additional prompt engineering. This suggests the architecture is robust to domain transfer and supports both symbolic and embodied tasks (Ocker et al., 31 Jul 2024).

7. Implementation Details

Tulip Agent is implemented in Python 3.10, with ChromaDB + HNSW providing vector-store infrastructure, OpenAI LLM APIs (gpt-3.5-turbo-0125, gpt-4-turbo) for language reasoning, and text-embedding-ada-002/text-embedding-3-* for semantic tool embedding. Function introspection uses Python AST and Sphinx-style docstrings, and importlib enables runtime dynamic loading.

Runtime tool extension involves the following workflow:

  1. CreateTool: The agent invokes code-generation LLMs to create Python function stubs with docstrings and type hints.
  2. The generated code is syntax-checked and sandbox-executed.
  3. Successful artifacts are written to disk, introspected for name/doc, embedded, and registered in both vector store and lookup mapping.
  4. Updating tools is similar, but the prompt is seeded with prior code and edit instructions; deletion removes the tool from the database and disk as appropriate.

This infrastructure makes Tulip Agent the first open-source LLM-agent framework offering extensible, semantic-searchable tool libraries with native runtime CRUD operations, recursive task decomposition, and demonstrated scalability to hundreds or thousands of tools with substantial cost efficiency and consistently high task accuracy (Ocker et al., 31 Jul 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Tulip Agent.