GRBench Dataset Overview
- GRBench is a graph reasoning benchmark dataset that systematically evaluates LLMs using text-attributed, interconnected graphs.
- It comprises 1,740 question–answer pairs from diverse domains such as academia, e-commerce, literature, healthcare, and legal records.
- The dataset facilitates transparent, multi-hop reasoning evaluation while mitigating LLM hallucinations with detailed graph traversal protocols.
GRBench is a manually constructed Graph Reasoning Benchmark dataset developed for the systematic evaluation of graph-based reasoning capabilities in LLMs and retrieval frameworks. Distinct from traditional retrieval benchmarks that assume knowledge is stored in independent text units, GRBench targets scenarios where information is structured as interconnected, text-attributed graphs, such as bibliographic networks, product catalogues, biomedical ontologies, and multi-entity legal records. This design enables robust assessment of both the factual grounding and the reasoning transparency of models that interface with structured knowledge bases.
1. Dataset Composition and Schema
GRBench comprises a collection of 1,740 question–answer pairs spanning ten real-world graphs across five domains: academia (covering six scientific fields), e-commerce (Amazon product graphs), literature (Goodreads), healthcare (disease and biomedical entity graphs), and legal (case-related document graphs) (Jin et al., 10 Apr 2024, Amayuelas et al., 18 Feb 2025, Kashmira et al., 11 Jul 2025). Each graph is a text-attributed graph, where nodes contain rich feature information (such as author, title, brand, or disease attributes), and edges encode explicit inter-entity relations (e.g., “written-by”, “cited-by”, “also-viewed”). The graphs possess heterogeneous node and edge types, and each is accompanied by a schema, facilitating multi-hop traversal and the retrieval of relational context.
Questions in GRBench are constructed using carefully designed templates, stratifying difficulty into three tiers:
- Easy: solvable by a single-hop lookup (e.g., “Who are the authors of {paper}?”).
- Medium: requiring multi-hop reasoning (e.g., “Who is the closest collaborator with {author}?”).
- Hard: demanding inductive or contextual reasoning not answerable by direct lookup (e.g., item recommendation based on graph context).
The dataset supports detailed model evaluation by providing the gold answer (ground truth), node and edge details, and, in some cases, reasoning traces derived from the graph's structure.
Domain | Example Node Types | Relation Types | Number of Graphs | Description |
---|---|---|---|---|
Academic | Paper, Author, Venue | written-by, cited-by | 6 | Six subject-specific paper graphs |
E-Commerce | Item, Brand | also-viewed, bought-together | 1 | Amazon product co-view network |
Literature | Book, Author, Publisher, Series | written-by, part-of-series | 1 | Goodreads dataset |
Healthcare | Disease, Gene, Drug | causes, treated-by | 1 | Biomedical/disease knowledge graph |
Legal | Case, Opinion, Docket | cites, involved-in | 1 | Case and legal document network |
2. Benchmarking Graph-Augmented Reasoning
GRBench was designed to facilitate research on augmenting LLMs with graph reasoning abilities, explicitly addressing limitations of conventional retrieval-augmentation approaches that ignore the relational structure of real-world knowledge. Models using GRBench must reason not just over isolated texts, but by traversing edges and synthesizing attributes within a heterogeneous network.
This is particularly significant for studying LLM hallucinations in knowledge-intensive tasks: while text-only retrieval may prompt plausible but unfactual answers, the explicit requirement to traverse a graph or knowledge base anchors each reasoning step in observable, structured data, providing both reliability and transparency (Jin et al., 10 Apr 2024, Amayuelas et al., 18 Feb 2025).
Applications evaluated using GRBench include academic literature and citation analysis, case law tracing, e-commerce recommendation, and structured biomedical QA, all requiring models to retrieve, filter, and combine information over multi-hop, multi-type graphs.
3. Graph-CoT and Sequential Agent Reasoning Frameworks
The reference implementation, Graph Chain-of-Thought (Graph-CoT), exemplifies model interaction with GRBench (Jin et al., 10 Apr 2024). The framework iteratively decomposes answering a question into three sub-steps for each reasoning iteration:
- LLM Reasoning: The model reflects on what additional graph information or node is required for the current reasoning step.
- LLM–Graph Interaction: Based on the prior, the model generates a function call to the graph, employing a small set of predefined APIs:
RetrieveNode(keyword)
NodeFeature(NodeID, FeatureName)
NodeDegree(NodeID, NeighborType)
NeighbourCheck(NodeID, NeighborType)
- Graph Execution: The call is executed, and the retrieved knowledge is fed back into the LLM for further reasoning.
This agent-like protocol continues until a conclusive answer is formulated, producing a transparent reasoning trace. Compared to baseline methods—such as standard LLMs without external retrieval, or retrieval-augmented LLMs with linearized text or sub-graphs—Graph-CoT demonstrates consistent improvements in both exact match (EM) and GPT-4–based scores across domains.
Recent research further extends this setup with more sophisticated strategies:
- Chain-of-Thought (CoT): Sequential, step-by-step reasoning where each intermediate state is grounded through KG actions.
- Tree-of-Thought (ToT): Parallel exploration of multiple reasoning paths via breadth-first expansion and candidate state evaluation.
- Graph-of-Thought (GoT): Merging and aggregating multiple paths using an aggregation transformation , capturing broader relational structures (Amayuelas et al., 18 Feb 2025).
GraphRunner introduces a further evolution, operating in three sequential stages: planning, verification, and execution. This scheme generates a holistic, multi-hop plan before execution, verifying it for consistency and thereby suppressing reasoning and hallucination errors typical for single-hop, tightly coupled LLM-step frameworks (Kashmira et al., 11 Jul 2025).
4. Evaluation Protocols and Metrics
GRBench supports a rigorous evaluation protocol. Models are assessed over several dimensions:
- Accuracy: Measured by exact match (EM), ROUGE-L (R-L, for longest common subsequence with ground-truth), and GPT4Score (model-based judgement of factual alignment to graph-derived answers).
- Efficiency: In tasks such as those evaluated by GraphRunner, efficiency is quantified by response generation time and token-based inference cost, calculated by $Cost = \$30\times(\text{Input Tokens}/1M) + \.
- Reasoning Quality: Analysis of reasoning traces identifies errors such as hallucinations (unsupported statements) and step-wise reasoning failures (e.g., early or redundant traversal).
- Context Management: The multi-hop, high-fanout graphs ensure evaluation of context window handling, as context size restrictions can directly impact performance in text-based LLMs.
For tasks such as object navigation in embodied AI, metrics include path length, success rate, success weighted by path length (SPL), and task-specific scores like excluded candidate rate (ECR) and satisfied condition rate (SCR) (Wang et al., 15 Jul 2024). These are formulated:
5. Empirical Findings and Research Impact
Benchmarks using GRBench reveal several key trends:
- Graph-CoT and planning-verification frameworks significantly outperform text-only or single-hop graph retrieval baselines. For instance, performance improvements of 10–50% in GPT4Score and reductions in inference cost by 3.0–12.9× are reported for GraphRunner (Kashmira et al., 11 Jul 2025).
- The chain-of-thought protocol, when interleaved with graph API calls, allows both greater factual correctness and interpretability; Tree-of-Thought expansion provides additional performance gains by exploring multiple reasoning paths (Amayuelas et al., 18 Feb 2025).
- Models’ ability to manage graph traversal—such as correctly selecting between node attributes and edge relationships—correlates strongly with accuracy on “medium” and “hard” question subsets.
- Explicit verification stages, as in GraphRunner, are particularly effective at hallucination mitigation, as verifying the holistic traversal plan against the graph’s structure prevents unsupported statements from reaching execution.
GRBench’s challenging, heterogeneous structure supports detailed error analysis, making it possible to diagnose whether failures are attributable to reasoning, context truncation, or inadequate graph exploration strategies.
6. Applications and Extensions
GRBench plays a central role in the paper of graph-augmented LLMs and retrieval architectures. Identified application domains are:
- Academic literature analysis: Enabling question answering and expert finding over citation and collaboration networks.
- Legal reasoning systems: Supporting traceable argumentation in precedent-based networks.
- E-commerce recommendation: Facilitating context-aware product recommendations using item co-view and brand relations.
- Healthcare and biomedical research: Integrating disease, gene, and drug relationships in grounded biomedical QA.
Its domain diversity and extensible API interfaces allow structuring of new tasks. The framework is further extensible to advanced graph reasoning paradigms (e.g., tree- and graph-of-thought protocols, hybrid symbolic–neural integration), and directly informs work on mitigating hallucinations in retrieval-augmented LLMs.
7. Availability and Community Adoption
GRBench is released as an open, reproducible benchmark. The dataset, along with reference code and documentation for the Graph-CoT framework, is accessible at https://github.com/PeterGriffinJin/Graph-CoT. It serves as the primary comparison point in several recent studies on graph-augmented LLMing and retrieval frameworks (Jin et al., 10 Apr 2024, Amayuelas et al., 18 Feb 2025, Kashmira et al., 11 Jul 2025). The transparent design, careful question annotation, and multi-domain coverage have led to its rapid adoption as a community standard for evaluating graph-based reasoning in both LLM and retrieval agent contexts.