Papers
Topics
Authors
Recent
2000 character limit reached

GTools Dataset: Instruction-Tuning for Graph Reasoning

Updated 14 December 2025
  • GTools is a large-scale instruction-tuning dataset designed for decomposed graph reasoning, breaking tasks into graph extraction, tool name identification, and parameter extraction subtasks.
  • It employs algorithmically labeled examples using canonical graph algorithms like DFS and Dijkstra to ensure ground-truth correctness and self-consistency.
  • The dataset covers twenty canonical graph reasoning tasks on various graph types and sizes, supporting robust training and evaluation of LLM-based graph reasoning.

GTools is a large-scale, instruction-tuning dataset created for the specific purpose of decomposed graph reasoning with LLMs. Designed in conjunction with the GraphTool-Instruction framework, GTools systematically addresses the extraction and manipulation of graph structure from natural language prompts by breaking tasks into graph extraction, tool name identification, and tool parameter extraction subtasks. Each example is algorithmically labeled using exact solutions (e.g., DFS, Dijkstra), ensuring ground-truth correctness and “self-consistency” of all decomposition steps. GTools covers twenty canonical graph reasoning tasks on both directed/undirected and weighted/unweighted graphs, and provides both training and evaluation splits in two graph size regimes.

1. Motivation and Design Goals

GTools was created in response to persistent deficits in prior LLM-based graph reasoning benchmarks, which either relied on text-only prompting (Text-Instruction) or ad hoc tool-calling schemas (Tool-Instruction). Both earlier paradigms exhibited poor reliability in (a) faithfully extracting arbitrary graph structure from natural-language input, and (b) reliably choosing and parameterizing the correct graph algorithm “tool.” By introducing a decomposed subtask protocol—graph extraction, tool name identification, and tool-parameter extraction—GTools enables the systematic evaluation and training of LLMs on graph reasoning tasks (Wang et al., 11 Dec 2024).

2. Task Taxonomy and Coverage

GTools encompasses twenty canonical graph reasoning tasks, organized by graph type and the complexity of required subtask decomposition.

  • Basic Graph‐Analysis (BGA) tasks: Require only graph structure and tool selection.
  • Parametric Graph-Query (PGQ) tasks: Involve graph structure, tool selection, plus extra parameters.

The following table catalogs the coverage:

Task Class Graph Type Example Tasks
BGA Undirected, unweighted Cycle Detection, Edge Count
BGA Undirected, weighted Maximum Triangle Sum
BGA Directed, unweighted Topological Sort, Node Count
PGQ Undirected, unweighted Degree Count, Path Existence
PGQ Undirected, weighted Shortest Path, Maximum Flow
PGQ Directed, weighted Shortest Path, Maximum Flow

Additional distinctions:

  • WL-Graph (“Within-Limit”): Graphs of up to 40 nodes and ≤300 edges.
  • EL-Graph (“Exceeds-Limit”): Graphs with 41–100 nodes and ≤1,000 edges.

3. Data Format, Storage, and Splits

Graph data in GTools is represented using concrete schema reflecting practical tool invocation.

  • WL-Graphs: Graphs are inlined in Python/NetworkX-style edge lists, with weights stored as integer attributes.
    1
    2
    
    edges = [(1,2), (1,3), (2,3)]
    weights = {(1,2):5, (2,3):2}
  • EL-Graphs: Provided as external files with file paths supplied in prompts, e.g., /mnt/data/graph_072.csv.

Dataset partitioning:

  • Training set: 40,000 examples (2,000 per task).
  • Held-out test set: 10,000 examples (500 per task).
  • No validation set: Original experiments tuned LLMs only on the full training split.

4. Graph Generation Protocol

Graphs are synthesized using a random graph generator with task-specific constraints:

  • WL-Graph regime: Sample node count N[2,40]N \in [2,40], add random edges until E300|E| \leq 300.
  • EL-Graph regime: N[41,100]N \in [41,100], E1,000|E| \leq 1,000.

Statistical averages:

NˉWL21.3,EˉWL147.2\bar N_{\text{WL}} \approx 21.3,\quad \bar E_{\text{WL}} \approx 147.2

NˉEL68.4,EˉEL534.8\bar N_{\text{EL}} \approx 68.4,\quad \bar E_{\text{EL}} \approx 534.8

Label sanity is enforced by:

  1. Truth/false balance in Boolean tasks.
  2. Ensuring unique topological orderings.
  3. Five prompt paraphrases per task, reducing prompting bias.

5. Annotation Pipeline and Ground-Truth Consistency

Annotation leverages a multi-stage pipeline:

A. Automated Labeling: Ground-truth answers generated by canonical graph algorithms run on each sampled graph and parameter set.

B. Instructional Prompting: Llama3-8B, equipped with GraphTool-Instruction, produces outputs for graph extraction, tool name, and parameter extraction subtasks.

C. Filtering: Only examples with complete subtask consistency—matching parsed graph structure, correct tool names, extracted parameter sets, and tool answer—are retained: M(gˉ(G),gˉ(N),gˉ(P),g^)=[gˉ(G)=g(G)]  [gˉ(N)=g(N)]  [gˉ(P)=g(P)]  [g^=g]M(\bar g^{(G)},\bar g^{(N)},\bar g^{(P)},\hat g) = [\bar g^{(G)}=g^{*(G)}]\ \land\ [\bar g^{(N)}=g^{*(N)}]\ \land\ [\bar g^{(P)}=g^{*(P)}]\ \land\ [\hat g=g^*] D. Alpaca Format Triples: (I(G),x(G),y(G)), (I(N),x(N),y(N)), (I(P),x(P),y(P))(I^{(G)},x^{(G)},y^{(G)}),\ (I^{(N)},x^{(N)},y^{(N)}),\ (I^{(P)},x^{(P)},y^{(P)}) Self-consistency across all 40,000 instances is achieved via this protocol.

6. Evaluation Metrics

GTools specifies four canonical accuracies for benchmarking LLM graph reasoning:

  1. Graph-Extraction Accuracy: AccG=#{examples where extracted edge list exactly equals ground truth}total examples\mathrm{Acc}_{G} = \frac{\#\{\text{examples where extracted edge list exactly equals ground truth}\}}{\text{total examples}}
  2. Tool-Name Accuracy: AccN=#{correct tool calls}total examples\mathrm{Acc}_{N} = \frac{\#\{\text{correct tool calls}\}}{\text{total examples}}
  3. Parameter Extraction Accuracy: AccP=#{exactly correct parameter sets}total examples\mathrm{Acc}_{P} = \frac{\#\{\text{exactly correct parameter sets}\}}{\text{total examples}}
  4. Final Answer Accuracy: Acctask=#{correct final answers}total examples\mathrm{Acc}_{\text{task}} = \frac{\#\{\text{correct final answers}\}}{\text{total examples}}

Boolean tasks permit F1 scoring: Precision=TPTP+FP,Recall=TPTP+FN,F1=2PrecisionRecallPrecision+Recall\mathrm{Precision}=\frac{TP}{TP+FP},\quad \mathrm{Recall}=\frac{TP}{TP+FN},\quad F1=2\frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}

7. Representative Example Instances

Sample instance, Cycle Detection (Undirected, unweighted, BGA):

  • Prompt:

1
2
Graph: [(1,2),(2,3),(3,1),(4,5)]
Task: “Does this graph contain a cycle?”

  • Extraction output: y(G)y^{(G)}: [(1,2),(2,3),(3,1),(4,5)]
  • Tool name: y(N)y^{(N)}: “has_cycle”
  • Parameters: (none)
  • Final answer: “True”

Sample instance, Shortest Path (Directed, weighted, PGQ):

  • Prompt:

1
2
Graph: [(A->B, w=5),(B->C, w=2),(A->C, w=20)]
Task: “Find the shortest distance from A to C.”

  • Extraction output: y(G)y^{(G)}: edges=[(A,B),(B,C),(A,C)], weights={(A,B):5,(B,C):2,(A,C):20}
  • Tool name: y(N)y^{(N)}:“dijkstra_shortest_path”
  • Parameters: source=A, target=C
  • Final answer: “7”

8. Limitations and Prospective Improvements

Current boundaries:

  • Only synthetic, randomly generated graphs; no real-world topologies included.
  • Exclusively static graphs; no support for dynamic or temporal structures.
  • Edge and node attributes limited to single integer weights/capacities.
  • Maximum graph size constrained to 100 nodes and 1,000 edges.

Suggested extensions include integrating larger graphs (via file streaming), multi-dimensional features, support for dynamic/temporal graphs (e.g., LLM4DyG), and real-world benchmarks drawn from domain-specific graph sources. A plausible implication is that future versions could incorporate more sophisticated comparison metrics such as graph edit distance and more extensive train/val/test splits on real datasets.

9. Summary and Relevance

GTools is the first instruction-tuning dataset for LLMs centered specifically on decomposed graph reasoning subtasks. Its systematic construction, ground-truthing via exact algorithms, and rigorous filtering ensure high fidelity and self-consistency, supporting empirical advances in LLM-based graph reasoning—such as observed improvements of over 30% in GraphForge (Llama3-8B) compared to Tool-Instruction enhanced GPT-3.5-turbo and strong relative performance to GPT-4o (Wang et al., 11 Dec 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to GTools Dataset.