Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 102 tok/s

GPT OSS 120B 462 tok/s Pro

Kimi K2 181 tok/s Pro

2000 character limit reached

A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models (2508.20583v1)

Published 28 Aug 2025 in cs.CL and cs.AI

Abstract: Developments in Graph-LLMs (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of LLMs. However, we demonstrate that current evaluation benchmarks for GLMs, which are primarily repurposed node-level classification datasets, are insufficient to assess multimodal reasoning. Our analysis reveals that strong performance on these benchmarks is achievable using unimodal information alone, suggesting that they do not necessitate graph-language integration. To address this evaluation gap, we introduce the CLEGR(Compositional Language-Graph Reasoning) benchmark, designed to evaluate multimodal reasoning at various complexity levels. Our benchmark employs a synthetic graph generation pipeline paired with questions that require joint reasoning over structure and textual semantics. We perform a thorough evaluation of representative GLM architectures and find that soft-prompted LLM baselines perform on par with GLMs that incorporate a full GNN backbone. This result calls into question the architectural necessity of incorporating graph structure into LLMs. We further show that GLMs exhibit significant performance degradation in tasks that require structural reasoning. These findings highlight limitations in the graph reasoning capabilities of current GLMs and provide a foundation for advancing the community toward explicit multimodal reasoning involving graph structure and language.

Collections

Summary

The paper demonstrates that current graph-language benchmarks inadequately assess multimodal reasoning, as unimodal approaches often match GLM performance.
The paper introduces CLeGR, a synthetic benchmark with over 1,000 graphs and 54,000 questions designed to enforce true joint reasoning over graph structure and textual semantics.
The empirical analysis, featuring high Pearson correlation (r=0.9643) and CKA results, reveals that complex graph encoders offer limited advantages over text-only baselines.

Rethinking Evaluation Paradigms for Graph-LLMs: Insights from CLeGR

Introduction

The integration of graph-structured data with natural language processing has led to the emergence of Graph-LLMs (GLMs), which aim to combine the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of LLMs. Despite rapid progress in model architectures, the evaluation of GLMs has largely relied on repurposed node-level classification datasets, which may not adequately assess multimodal reasoning. This paper systematically analyzes the limitations of current benchmarks and introduces the CLeGR (Compositional Language-Graph Reasoning) benchmark to rigorously evaluate the joint reasoning capabilities of GLMs.

Figure 1: Current graph-language benchmarks are insufficient for evaluating multimodal reasoning; unimodal approaches suffice for strong performance.

Limitations of Existing Benchmarks

The paper demonstrates that current graph-language benchmarks are fundamentally insufficient for evaluating multimodal reasoning. Through extensive experiments on six widely-used Text-Attributed Graph (TAG) datasets (Cora, CiteSeer, Computers, Photo, History, Arxiv), the authors show that strong performance can be achieved using unimodal information alone. Specifically, linear probing on graph tokens matches GLM performance on structurally-sufficient datasets, while soft-prompted LLMs using only text attributes achieve comparable results on semantically-sufficient datasets.

Figure 2: Linear probe accuracy closely matches full GLM performance on structurally-sufficient datasets, indicating the graph encoder captures all task-relevant information.

This finding is supported by a high Pearson correlation ( $r=0.9643$ ) between linear probe and GLM accuracy, suggesting that the LLM component in GLMs often acts as an expensive decoder head rather than contributing to multimodal integration. The analysis categorizes datasets into semantically-sufficient (where text alone suffices) and structurally-sufficient (where graph structure dominates), revealing a lack of benchmarks that require true integration of both modalities.

The CLeGR Benchmark: Design and Motivation

To address the evaluation gap, the paper introduces CLeGR, a synthetic benchmark explicitly constructed to require joint reasoning over graph structure and textual semantics. CLeGR comprises over 1,000 diverse graphs and 54,000 questions, spanning factual recall (CLeGR-Facts) and compositional reasoning (CLeGR-Reasoning) tasks. The benchmark is designed to preclude unimodal solutions by enforcing structural dependency, semantic grounding, and compositional complexity.

Figure 3: CLeGR evaluation framework and benchmark structure, covering factual and compositional reasoning tasks across multiple reasoning types and scopes.

CLeGR-Reasoning systematically covers filtering, aggregation, path reasoning, and topology tasks, each requiring multi-step inference that blends property lookup with logical graph traversal. The synthetic nature of the graphs eliminates pre-training confounds, ensuring that models cannot rely on memorized knowledge.

Empirical Evaluation of GLMs on CLeGR

The evaluation of representative GLM architectures (TEA-GLM, GraphToken, G-Retriever) and soft-prompted LLM baselines on CLeGR reveals several critical findings:

Fact-based retrieval tasks: GLMs saturate performance, matching soft-prompted LLMs.
Reasoning tasks: GLMs fail to outperform soft-prompted baselines, indicating insufficient graph-language integration.
Zero-shot generalization: GLMs provide no transfer benefits compared to soft-prompted approaches when moving from subway to computer network domains.
Scaling with graph size: Increased structural complexity does not confer any advantage to GLMs over text-only baselines; both approaches degrade similarly.
Figure 4: GLMs achieve saturation on fact-based tasks but fail to outperform soft-prompted baselines on reasoning tasks requiring structural understanding.

Figure 5: Zero-shot generalization from subway to computer network domains shows no transfer benefit for GLMs over soft-prompted approaches.

These results challenge the architectural necessity of incorporating graph structure into LLMs for multimodal reasoning, as current GLMs revert to unimodal textual processing even when provided with explicit structural information.

Representation Analysis via CKA

To further investigate the underlying cause of performance parity, the paper employs Centered Kernel Alignment (CKA) to measure representational overlap between GLMs and soft-prompted LLMs. The analysis shows that semantically-sufficient datasets and CLeGR tasks maintain high CKA across all layers, indicating near-identical internal representations. Structurally-sufficient datasets diverge in mid layers, aligning with the observed failure of soft-prompted baselines.

Figure 6: CKA analysis shows strong alignment of representations when performance is similar; divergence occurs only in structurally-sufficient datasets.

This suggests that GLMs learn distinct representations only when the dataset is structurally-sufficient and the LLM's semantic reasoning is underutilized.

Implementation and Experimental Considerations

The paper provides detailed implementation protocols for GLMs and baselines, including:

Model architectures: TEA-GLM, GraphToken (GSAGE/GAT), G-Retriever, and soft-prompted LLMs (Llama3-8B, Phi3-3.5B, Phi4-14B).
Training setup: Consistent hardware (NVIDIA A100 80GB), batch sizes, learning rates, and greedy decoding.
Evaluation metrics: Overall accuracy, F1-score, MCC, MAE, RMSE, and set-based precision/recall for different answer types.
Prompt engineering: Structured prompts for both node classification and graph QA tasks, with explicit output format suffixes.

The CLeGR dataset is publicly available, enabling reproducibility and further research.

Implications and Future Directions

The findings have significant implications for the development and evaluation of GLMs:

Benchmark design: There is a critical need for benchmarks that require genuine multimodal integration, as current datasets are insufficient.
Model architecture: The results question the utility of complex graph encoders in GLMs, suggesting that architectural innovation should focus on mechanisms that enforce cross-modal interaction.
Generalization claims: The lack of zero-shot transfer benefits undermines claims of superior generalization for GLMs, highlighting the need for more robust evaluation protocols.
Representation learning: Future work should explore methods that explicitly align and fuse graph and language representations, potentially leveraging cross-modal attention or joint training objectives.

The CLeGR benchmark provides a foundation for advancing research in explicit multimodal reasoning involving graph structure and language.

Conclusion

This paper presents a rigorous analysis of the limitations of current graph-language benchmarks and introduces the CLeGR benchmark to evaluate multimodal reasoning. The empirical results demonstrate that existing GLMs do not effectively integrate graph and language modalities, as unimodal baselines suffice for strong performance. The paper calls for a paradigm shift in both benchmark design and model architecture, emphasizing the need for explicit multimodal integration to realize the full potential of graph-LLMs.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (6)

Tweets

https://twitter.com/akshitwt/status/1961455800364470336

alphaXiv

A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models (18 likes, 0 questions)