BTGenBot-2: Open-Source Behavior Tree Generator

Updated 9 February 2026

BTGenBot-2 is an open-source 1B-parameter model that converts natural language task descriptions and robot action primitives into executable XML behavior trees.
It provides a standardized benchmark suite with 52 tasks for robotic navigation and tabletop manipulation, ensuring reproducible and fair evaluation in high-fidelity simulations.
The model employs zero-shot and one-shot prompting with XML syntax validation and error recovery protocols, achieving high success rates and fast inference times.

BTGenBot-2 is a 1B-parameter open-source small LLM designed to generate executable behavior trees (BTs) from natural language task descriptions and a list of robot action primitives, outputting directly in XML for seamless deployment on resource-constrained robots. Developed to address the limitations of closed-source and computationally intensive LLM task planners in robotics, BTGenBot-2 also introduces a standardized benchmark suite for LLM based BT generation. The benchmark supports reproducible, plug-and-play evaluation of any LLM's ability to produce operational BTs for robotic navigation and manipulation tasks within a high-fidelity simulated environment (Izzo et al., 2 Feb 2026).

1. Benchmark Composition and Task Structure

BTGenBot-2’s benchmark encompasses 52 distinct tasks divided between navigation (32 tasks: 12 easy, 10 medium, 10 hard) and tabletop manipulation (20 tasks: 6 easy, 8 medium, 6 hard), enabling comprehensive coverage over a range of robot capabilities and task difficulties. Tasks are grouped into three difficulty tiers: easy (18), medium (18), and hard (16), providing granularity for performance stratification.

The simulation environment for all tasks is NVIDIA Isaac Sim, chosen for its high-fidelity physics modeling (covering both wheeled and manipulator robots), native ROS 2 and BehaviorTree.CPP integration, and widespread adoption within robotic research. This ensures environmental consistency and reproducibility.

Behavior trees are specified using an XML format compatible out-of-the-box with ROS 2 BehaviorTree.CPP. Key XML schema elements include:

Root: <BehaviorTree ID="...">
Control-flow internal nodes: <Sequence>, <Fallback>, <Parallel>, <Decorator>
Leaf nodes: <Action name="..."/>, <Condition name="..."/>
Action parameters via XML attributes, e.g. <Action name="GoTo" target="wp1"/>

XML schema validity is enforced during BT generation via a Python XML parser and a YAML action/condition specification, ensuring that only allowed primitives are invoked.

2. Evaluation Protocols and Execution Workflow

Evaluation within the benchmark is structured into three stages: model prompting, inference-time validation, and runtime execution.

A. Prompting Modes

Zero-shot: The LLM receives a brief natural-language task description and a bullet list of available robot action primitives with parameter signatures.
One-shot: Adds a single demonstration (an input prompt plus a reference XML BT) to the context.

Prompts follow a canonical structure:

Instruction: You are a BT generator.
Task: <natural language description>
Actions:
  - MoveTo(location)
  - Pick(object)
  - Place(object, location)
  ...
Output: a valid BehaviorTree in XML.

B. Inference-Time Validation

The 4-bit-quantized BTGenBot-2 model runs on an NVIDIA GTX 1080 (8GB).
Each candidate BT undergoes:
1. XML syntax validation.
2. Verification that only allowed primitives from the prompt are used.
On validation failure, decoding is retried (usually resolved within 1–2 attempts).

C. Runtime Execution and Recovery

Generated BTs are executed in Isaac Sim under the BehaviorTree.CPP framework.
Each leaf’s status (Success/Failure/Running) is logged by a C++ runtime validator.
Fallback and Retry nodes are allowed to trigger on local failure.
If unrecoverable, a subtree regeneration query is issued to the LLM; the revised subtree is validated and re-inserted for continued execution.

3. Metrics and Model Ranking

Functional Metrics

Success Rate (SR):

$SR = \frac{\text{Number of Tasks Completed}}{\text{Total Number of Tasks}} \times 100\%$

Pass@k (P@k):

Fraction for which at least one of the top k BTs executes correctly.

Inference Time:

$t_\text{avg} = \frac{1}{N}\sum_{i=1}^{N} t_i$

(Average seconds per BT).

Non-Functional Metrics

Action Coherency: 0/1 per task for exclusive use of allowed primitives.
XML Syntax Validity: 0/1 via the schema-aware parser.
Semantic Correctness: Binary, majority vote of three expert annotators.
Model Footprint: 1B parameters; ≈4 GB runtime memory under quantization.

Composite Model Ranking

Models are primarily ranked by success rate. Ties are resolved using Pass@3, then by average inference time.

4. Empirical Results, Reproducibility, and Comparative Insights

All model evaluations are performed under identical conditions, with strict prompt format, task set, and open benchmarking in Isaac Sim. No model-specific prompt engineering is applied in zero-shot mode, isolating model-specific generative capability.

Key Findings

Zero-shot Success Rate:
- BTGenBot-2 (no error recovery): 84.61%
- GPT-5 (“Thinking mode”): 71.15%
- Claude Opus 4.1: 65.38%
With Error Recovery:
- BTGenBot-2-ER achieves 90.38% zero-shot SR.
One-shot Success Rate:
- BTGenBot-2-ER: 98.07%
- Proprietary LLMs: ≈85–88%
Inference Speed:
- BTGenBot-2 requires ≈11 s/BT, up to 16× faster than its predecessor (7B params).

This methodological rigor enables direct, reproducible, and fair head-to-head comparisons between open-source and proprietary LLMs (Izzo et al., 2 Feb 2026).

5. Representative Task and Output Examples

Task Specification → Generated Behavior Tree

A typical zero-shot prompt:

Task: Navigate to waypoints α, β, then return home, avoiding known hazards.
Actions:
  - MoveTo(point)
  - CheckObstacle()
  - EmergencyStop()
Output: XML BehaviorTree.

Abridged model output:

<BehaviorTree ID="NavSequence">
  <Sequence>
    <Fallback>
      <Sequence>
        <CheckObstacle/>
        <MoveTo point="α"/>
      </Sequence>
      <EmergencyStop/>
    </Fallback>
    <Fallback>
      <Sequence>
        <CheckObstacle/>
        <MoveTo point="β"/>
      </Sequence>
      <EmergencyStop/>
    </Fallback>
    <MoveTo point="home"/>
  </Sequence>
</BehaviorTree>

Metric Calculation Example

For n=10 medium tasks, 8 successful BTs yield SR(medium) = 80%. If inference times are [7.2, 8.0, 6.8, ...] s, $t_{\text{avg}} \approx 7.5$ s.

6. Limitations, Extensions, and Future Directions

Current coverage is restricted to navigation and tabletop manipulation tasks within simulated environments. Potential extensions include:

Broadening to mobile manipulation, aerial robotics, and multi-robot scenarios.
Deploying on real hardware via ROS1/ROS2 hardware-in-the-loop.
Introducing perception, vision, dynamic obstacles, and temporally variable tasks for increased realism and challenge.

An observable limitation is the exclusive reliance on simulation, which may not reflect all complexities of real-world deployment.

7. Significance and Broader Context

BTGenBot-2 is the first public, ROS 2-compatible standardized suite for LLM-driven BT generation in robotics, addressing the previously unmet need for open, reproducible, and plug-and-play evaluation in this domain. The uniform task set, XML schema, simulation platform, and dual-layer metrics eliminate ad-hoc comparison regimes, enabling the research community to immediately benchmark, fine-tune, or further develop LLMs for robotic control (Izzo et al., 2 Feb 2026). Its methodology aligns with broader trends in LLM benchmarking, emphasizing open infrastructure, reproducibility, and comprehensive measurement of functional and non-functional properties.

Markdown Upgrade to Chat

References (1)

BTGenBot-2: Efficient Behavior Tree Generation with Small Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BTGenBot-2.

BTGenBot-2: Open-Source Behavior Tree Generator

1. Benchmark Composition and Task Structure

2. Evaluation Protocols and Execution Workflow

A. Prompting Modes

B. Inference-Time Validation

C. Runtime Execution and Recovery

3. Metrics and Model Ranking

Functional Metrics

Non-Functional Metrics

Composite Model Ranking

4. Empirical Results, Reproducibility, and Comparative Insights

Key Findings

5. Representative Task and Output Examples

Task Specification → Generated Behavior Tree

Metric Calculation Example

6. Limitations, Extensions, and Future Directions

7. Significance and Broader Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

BTGenBot-2: Open-Source Behavior Tree Generator

1. Benchmark Composition and Task Structure

2. Evaluation Protocols and Execution Workflow

A. Prompting Modes

B. Inference-Time Validation

C. Runtime Execution and Recovery

3. Metrics and Model Ranking

Functional Metrics

Non-Functional Metrics

Composite Model Ranking

4. Empirical Results, Reproducibility, and Comparative Insights

Key Findings

5. Representative Task and Output Examples

Task Specification → Generated Behavior Tree

Metric Calculation Example

6. Limitations, Extensions, and Future Directions

7. Significance and Broader Context

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research