GraphForge: Llama3-8B for Graph Reasoning

Updated 11 December 2025

GraphForge is a fine-tuned large language model for graph reasoning that leverages a three-stage instruction decomposition—graph extraction, tool name identification, and parameter extraction—to ensure precise task execution.
It employs low-rank adaptation (LoRA) on the Llama3-8B-Instruct backbone, efficiently updating a minimal set of parameters while maintaining competitive performance on diverse graph tasks.
The model outperforms previous paradigms by achieving up to 99.52% accuracy on benchmarks, demonstrating robust generalization on both in-domain and out-of-distribution graph reasoning challenges.

GraphForge is a fine-tuned LLM for graph reasoning, based on the Llama3-8B-Instruct foundation, and developed using the GraphTool-Instruction methodology and the GTools dataset. Through a three-stage instruction decomposition—graph extraction, tool name identification, and tool parameter extraction—GraphForge achieves state-of-the-art performance across a suite of complex graph reasoning tasks, surpassing prior text-instruction and tool-instruction paradigms within the same model capacity tier. Leveraging low-rank adaptation (LoRA) for efficient parameter updating, GraphForge matches or exceeds the results of much larger proprietary models on both in-domain and out-of-distribution benchmarks (Wang et al., 11 Dec 2024).

1. Model Architecture and Adaptation Protocol

GraphForge is initialized with the Llama3-8B-Instruct backbone. Rather than modifying the architecture, LoRA is used to perform parameter-efficient fine-tuning, updating the model’s weights via low-rank adapters. The update operation, as formalized in the paper (Eq. 17), is

$h_i = W x_i + (\alpha/r) B A x_i$

where $W$ is the original 8B-parameter transform, $A \in \mathbb{R}^{r \times n}$ and %%%%3%%%% are the adapter matrices with $r \ll n, m$ , and $\alpha$ is a scaling hyperparameter. Only $A$ and $B$ are updated during fine-tuning, confining the adaptation to a small rank- $r$ subspace.

GraphForge incorporates GraphTool-Instruction at inference. Each input is decomposed into the sequence of: (1) graph extraction, (2) tool name identification, and (3) tool parameter extraction. Each subtask produces intermediate outputs, which are parsed via regular expressions and ultimately drive an API call to an external graph library such as NetworkX, yielding the final solution.

2. Instruction Decomposition: The GraphTool-Instruction Methodology

GraphTool-Instruction decomposes graph reasoning tasks into three subtasks, each associated with a specialized prompt and post-processing step:

Graph Extraction ( $\mathcal{G}$ ): Extracts the graph structure $(V, E)$ , with optional edge attributes, from natural language or file references. The model uses prompt variants according to graph size. For small graphs (≤4096 tokens), explicit edge lists (NetworkX format) are requested; for larger graphs, file paths are extracted instead.
Tool Name Identification ( $\mathcal{N}$ ): Selects, from a predefined set, the correct API/function according to the task. Prompts enumerate available tools and their descriptions; model outputs are parsed for the selected tool name.
Tool Parameter Extraction ( $\mathcal{P}$ ): For parameterized tasks (e.g., shortest path, maximum flow), retrieves and formats the necessary tool parameters via a two-step prompt utilizing the tool’s parameter template, extracting source, target, etc.

Each intermediate output is post-processed via a regular expression to (i) accurately reconstruct the graph, (ii) ensure robust tool selection, and (iii) supply well-typed parameters for execution.

3. GTools Dataset: Task Diversity and Data Construction

GTools is a procedurally generated dataset designed explicitly for instruction tuning in graph reasoning contexts. Its main features include:

Feature	Description
Task coverage	20 graph reasoning tasks (11 core types × directed/undirected variants)
Example count	2,000 per task (total 40,000)
Graph types and size	WL-Graph: $\|V\| \in [2,40]$ , $\|E\| \leq 300$ ; EL-Graph: $\|V\| \in [41,100]$ , $\|E\| \leq 1000$
Annotation	Alpaca-style triples $(\mathcal{I}^{(s)}, x_i^{(s)}, y_i^{(s)})$ for $s \in \{\mathcal{G},\mathcal{N},\mathcal{P}\}$

A key property of GTools construction is the filtering for instruction-response pairs in which the tool's output exactly matches the gold standard, ensuring annotation reliability.

4. Fine-Tuning, Hyperparameters, and Infrastructure

GraphForge is fine-tuned using LoRA solely on $A$ and $B$ , with a cross-entropy loss over token outputs (Eqs. 18–19). Key training hyperparameters are:

Learning rate: $1 \times 10^{-5}$ , with 10% warmup
Batch size: 4
Scheduler: cosine decay
Epochs: 3
Hardware: Fine-tuning on a single NVIDIA A800 (80 GB); inference across 16× Tesla T4 GPUs for open-source models

Training graphs are procedurally generated to span a diversity of connectivities and class labels; up to five task description templates are used for robustness. Regular expression post-processing is employed during both data construction and inference.

5. Empirical Evaluation and Results

GraphForge demonstrates superior results across several benchmarks:

WL-Graph Tasks (Table 4): GraphForge achieves 98.4% average accuracy, compared to the best text-instruction baseline (GraphWiz) at 46.2%.
Tool-Instruction Baselines (Table 5): For WL-Graphs, GraphForge outperforms GPT-3.5-turbo-FC by more than 30% and matches GPT-4o-FC (98.8% vs. 98.4% for WL; 99.0% for EL).
Generalization (Table 6): On the NLGraph benchmark, including out-of-domain (OOD) tasks such as Bipartite Matching and GNN state updates, GraphForge attains 99.52% accuracy, matching or exceeding closed-source models.
Per-task Performance: All tasks—including computationally intensive ones such as Maximum Flow—exceed 97% accuracy.

Ablation studies indicate that both Graph-Instruction (GI) and Parameter-Instruction (PI) are crucial. Removal of GI decreases accuracy on Cycle Detection from 99% to 86.2% and Triangle Detection from 97.8% to 66.8%. Removal of PI in parametric tasks degrades Path from 98.8% to 89.0% and Shortest Path from 98.2% to 92.2%. Removing both GI and PI results in the most pronounced performance drops (Path: 51.2%, Shortest: 33.6%). Tool-name identification is found to be a near-solved subproblem, while large-graph extraction remains the bottleneck.

6. Analysis of Errors, Generalization, and Usability

GraphForge’s error spectrum is dominated by graph mismatches on very large WL-Graphs. The introduction of PI nearly eliminates parameter extraction errors. Syntax errors and tool misidentification are rare. High generalization to OOD problems suggests robustness of the instruction decomposition even on tasks not explicitly included in training.

GraphTool-Instruction methodology is model-agnostic and can serve as a plug-and-play prompting strategy across open-source LLMs, providing substantial improvements without weight modification.

7. Outlook and Future Directions

Key implications from GraphForge's empirical results include:

Instruction decomposition—separating extraction, tool selection, and parameterization—enables small LLMs (<13B) to match or exceed performance of much larger or closed-source competitors.
The method supports integration into diverse open-source LLMs (e.g., GLM4-9B, Llama3-8B, Llama3.1-8B) in a prompt-based, stateless manner.
Extensions proposed include: expanding supported graph APIs (e.g., knowledge-graph or recommender modules), improving graph understanding (GU) in extremely large graph settings (possibly via chunked or hybrid retrieval strategies), and integrating iterative “Graph-of-Thought” processes for complex, multi-step reasoning.

The demonstrated advance is that carefully crafted instruction decomposition, coupled with targeted LoRA adaptation and a broad, high-fidelity tool-instruction dataset, bridges the gap between open, cost-effective LLMs and proprietary solutions on graph reasoning tasks (Wang et al., 11 Dec 2024).

PDF Markdown Chat (Pro)

References (1)

GraphTool-Instruction: Revolutionizing Graph Reasoning in LLMs through Decomposed Subtask Instruction (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to GraphForge (Llama3-8B).