Large Language Models on Graphs

Updated 9 April 2026

Large language models on graphs are methods that integrate transformer-based text understanding with graph-structured data to enhance embeddings and predictions.
They employ innovative techniques such as graph linearization, adapter fusion, and retrieval-augmented generation to address challenges like context limits and structural gaps.
They improve practical tasks including node classification, link prediction, and dynamic analysis by bridging natural language processing with graph computation.

LLMs on graphs refer to the integration, adaptation, and application of neural LLMs—especially transformer-based architectures pretrained on vast corpora of natural language—directly to graph-structured data. This endeavor merges the text-understanding capacity and world knowledge of LLMs with the relational, non-Euclidean structure of graph data. Major research directions include: (i) developing graph representations that LLMs can process, (ii) augmenting LLMs with explicit graph-computation capabilities, (iii) interfacing LLMs and graph neural networks (GNNs), and (iv) addressing fundamental challenges in graph learning such as data incompleteness, imbalance, domain heterogeneity, and dynamics by leveraging LLM-driven approaches. The resulting models enable enhanced graph embedding, link prediction, node classification, cross-modal reasoning, and professional analysis pipelines on both text-attributed and pure-structure graphs.

1. Architectural Paradigms for LLMs on Graphs

There are four dominant classes of architectures for deploying LLMs on graph tasks (Ren et al., 2024, Jin et al., 2023):

1. GNNs as Prefix: A GNN encodes the graph structure into embeddings, which are then provided as soft prompts or prefix tokens to the LLM. The LLM attends over both the structural tokens and any node or edge text to perform the final prediction. Notably, parameter-efficient schemes such as GPEFT use a trainable GNN prompt encoder and LoRA/PEFT-style adapters atop a frozen LLM, enabling efficient fine-tuning at scale (Zhu et al., 2024).

2. LLMs as Prefix: LLMs process node or edge text to produce embeddings or pseudo-labels, which are then used to initialize or supervise the GNN. This strategy injects world knowledge into graph learning, and enables downstream zero-shot or few-shot task adaptation.

3. LLM–Graph Integration: Joint training of LLM and GNN modules occurs via architecture fusion (e.g., cross-modal attention, interleaving transformer/GNN layers) or cross-modal contrastive/pseudo-label alignment. Examples include chained adapters, fusion modules, and agentic tool users as in GraphChain (Wei et al., 1 Nov 2025), and deep contrastive pipelines.

4. LLMs-Only/Pure Sequence Methods: The graph—either in whole or as subgraph neighborhoods—is serialized into a linear token sequence for the LLM to process (graph linearization); predictions are then made via prompt-driven or fine-tuned language modeling (Xypolopoulos et al., 2024). This enables end-to-end graph reasoning with no auxiliary graph module, at the cost of severe input-length constraints for large graphs.

A systematic summary of strengths and limitations for these paradigms appears in Table 1 (adapted from (Ren et al., 2024)):

Framework	Structural Bias	Scalability	Textual Modeling
GNNs as Prefix	strong	moderate	direct LLM integration
LLMs as Prefix	moderate	high	strong world knowledge
Integration	strong	moderate	bi-directional knowledge
LLMs-Only	weak	low	pure text/serialization

2. Encoding Graph Structure for LLM Consumption

Two primary strategies enable graph structure to be ingested by LLMs:

A. Graph Linearization: Graphs are mapped to token sequences, typically as edge lists (with centrality/degeneracy-based ordering and node relabeling to maximize local dependency and global alignment), triplet structures, or natural language descriptions (Xypolopoulos et al., 2024, Kyaw et al., 22 Mar 2026, Ouyang et al., 2024). Graph projection as (src, dst, weight) triplets suffices for a broad range of tasks, as in GUNDAM, which achieves superior results over GPT-4 using only simple serializations plus chain-of-thought (CoT) training (Ouyang et al., 2024).

B. Textualization of Local Structure: In text-attributed graphs, textual information from nodes and their h-hop neighborhoods is verbalized via hierarchical or soft-prompt compression (HiCom (Zhang et al., 2024)), neighborhood attribute summaries, or neighborhood sampling (two-stage in LPNL (Bi et al., 2024)), producing prompts whose length adheres to LLM context constraints.

C. Adapter and Fusion Techniques: GNN-derived features are mapped to the LLM’s embedding space via lightweight adapters, allowing attention and fusion at the model layer level (Zhu et al., 2024, Ren et al., 2024).

3. Model Training and Inference Workflows

A wide range of training, fine-tuning, and inference strategies have been developed for LLMs on graph data:

Parameter-Efficient Fine-Tuning (PEFT): Methods such as LoRA or prefix tuning train only a few percent of LLM parameters, combined with a small GNN prompt encoder. This approach allows “frozen” billion-parameter LLMs to produce high-quality graph embeddings with small additional cost, as demonstrated in GPEFT for link prediction and node retrieval (Zhu et al., 2024).
Hierarchical Compression and Soft-Prompting: For dense, text-rich graphs, hierarchical schemes recursively compress neighbor texts into fixed-length vectors using learnable prompts and LLM forward passes, circumventing the quadratic attention cost of direct neighborhood concatenation (Zhang et al., 2024).
Retrieval-Augmented Generation (RAG) and In-Context Learning: Graph-guided retrieval injects relevant node/neighbor text or labels into LLM prompts, greatly improving in-context prediction accuracy over vanilla few-shot or zero-shot RAG using text retrieval only (Li et al., 19 Feb 2025). FEWSHOTRAG (contextualizing a node with (text, label) pairs from its neighborhood) yields accuracy competitive with standard GNNs in homophilic settings.
Chained Tool Use and Reasoning: LLMs can be orchestrated to dynamically chain together graph-analysis tools (e.g., NetworkX functions) under policy optimization to enable scalable multi-step reasoning over massive graphs, as in GraphChain (Wei et al., 1 Nov 2025). This approach leverages RL to plan sequences of tool calls, incorporates structure-aware adapters for domain transfer, and achieves superior scalability and task success compared to prompt-only methods.
Graph Reasoning via CoT (Chain-of-Thought) Annotation: Generating stepwise reasoning paths with algorithmic correctness (as in GUNDAM (Ouyang et al., 2024)) and tuning models on these chains empirically improves graph-based logical reasoning.

4. Application Domains and Empirical Performance

LLMs have been evaluated on a spectrum of graph tasks, including:

Node and Edge Classification: Enhanced node embeddings from LLMs (often via SBERT/E5 or LoRA-tuned LLMs) matched or exceeded standard GNN baselines on text-rich datasets under both low- and high-label regimes (Chen et al., 2023, Ren et al., 2024). HiCom outperformed all prior methods on Amazon and MAG (Zhang et al., 2024).
Link Prediction: GPEFT and LPNL frameworks yield strong improvements (up to +30% Hit@1 over HGT) on large-scale author disambiguation/link prediction tasks, using sampling-controlled prompts and self-supervised fine-tuning (Zhu et al., 2024, Bi et al., 2024).
Graph Reasoning: On standard logical and combinatorial graph tasks (connectivity, cycle, flow), pure LLMs (GPT-4) outperform open-source alternatives, but lag behind specialized, alignment-tuned architectures like GUNDAM (Ouyang et al., 2024).
Large-Scale Graph Analysis: GraphChain achieved >80% accuracy on graphs up to 200K nodes, exceeding prior tool-invocation and prompt-only baselines by >20 points (Wei et al., 1 Nov 2025).
Adversarial Robustness: LLM feature pipelines enhance resilience to adversarial structural or textual perturbations relative to shallow models or GNNs with bag-of-words features, due to richer class separation and reduced attack-induced drift (Guo et al., 2024).

Empirical performance is subject to context length, density, and the homophily of the given graph. Graph-guided in-context learning (e.g., FEWSHOTRAG, LABELRAG) closes much of the gap between LLM-only and GNN baselines in homophilic settings, while failures persist in highly heterophilic or structurally complex graphs (discussed below).

5. Addressing Fundamental Graph Challenges with LLMs

A comprehensive survey identifies how LLMs address or mitigate four core obstacles in practical graph learning (Li et al., 24 May 2025):

Incompleteness: LLM-driven imputation—either by generating missing attributes (prompt-based), predicting edges via natural language queries, or via chain-of-thought edge completion—improves accuracy by 5–10% over variational GNNs, especially in few-shot setups.
Imbalance: LLMs enable semantic augmentation for minority classes (synthetic node/sample generation), contextual debiasing (zero-shot prompts), and knowledge-injection (retrieval-augmented prompts), achieving +8% AUC in severely imbalanced networks.
Cross-Domain Heterogeneity: LLMs serve as textual/semantic bridges in multi-modal graphs, aligning disparate node/edge spaces by restructuring multimodal or cross-lingual graph attributes into unified tokens, fusion modules, or structure-to-text descriptors (Kyaw et al., 22 Mar 2026).
Dynamic Instability: LLMs parse evolving graphs via temporal prompt chains, generate future knowledge graph triples, and induce temporal prediction rules via in-context and retrieval-augmented pipelines, yielding significant performance gains over static GNNs (e.g., +20% link prediction accuracy) in dynamic settings.

6. Current Limitations, Scalability, and Open Challenges

Despite transformative progress, significant challenges remain:

Context Window and Input Size Constraints: All methods relying on full-graph serialization or neighborhood expansion are hard-limited by LLM tokenizer capacity, and suffer from quadratic attention scaling; strategies such as hierarchical compression (HiCom), prompt sampling, or divide-and-conquer partitioning (LPNL) partly alleviate but do not remove these bottlenecks (Zhang et al., 2024, Bi et al., 2024).
Structural Reasoning Gaps: Even the strongest models, such as GPT-4, fall short in multi-answer enumeration, complex path or cycle detection, and generalization to large pure-structure graphs without further fine-tuning or explicit alignment (Liu et al., 2023, Ouyang et al., 2024).
Graph Hallucination and Fidelity: Success in single-task or short-answer settings often vanishes in more elaborate, multi-step tasks; hallucinations and inconsistent answers remain a major obstacle for critical pipeline deployment (Liu et al., 2023).
Scalability vs. Expressiveness Tradeoff: Deep integration pipelines (joint LLM-GNN fusion) impose significant computational cost, while prompt-based and adapter methods are more efficient but may underleverage structure (Ren et al., 2024).
Heterophily and Out-of-Distribution Limits: Homophily is a core assumption in many prompt-guided or RAG-based LLM frameworks; performance collapses on heterophilic graphs unless edge discrimination and message reweighting are adapted with explicit LLM support (Wu et al., 2024).
Domain Adaptation and Continual Learning: Efficient transfer to new graph domains (e.g., cross-lingual, multi-modal synthesis) and resilience to dynamic or evolving topologies are ongoing research targets (Kyaw et al., 22 Mar 2026, Li et al., 24 May 2025).

7. Future Directions and Benchmarks

Key research priorities identified across recent surveys and empirical studies include:

Development of graph-specific benchmarks emphasizing multi-modal, dynamic, and large-scale graph tasks with LLM-integration (Ren et al., 2024, Li et al., 24 May 2025).
Efficiency-focused architectures: Parameter-efficient LLM adaptation (LoRA, PEFT, adapters), compressed prompt injection, and scalable retrieval modules for million-node graphs (Zhu et al., 2024, Bi et al., 2024).
Interpretability: Visualization of graph-attention, CoT step tracing, and counterfactual editing to explain and trust LLM-driven inferences (Li et al., 24 May 2025).
Theoretical understanding: Analysis of inductive bias and generalization properties in hybrid transformer–message-passing frameworks, and alignment between latent spaces of text, structure, and supervision (Ren et al., 2024, Jin et al., 2023).
Exploration of interactive graph agents, multi-turn QA, and cross-modal pipelines (notably for bioinformatics, finance, and transportation) (Li et al., 24 May 2025).
Advancements in continual learning and adaptation: lifelong, domain-adaptive LLM-GNN systems for temporally drifting graphs (Li et al., 24 May 2025).

References

(Zhu et al., 2024) Parameter-Efficient Tuning LLMs for Graph Representation Learning
(Li et al., 24 May 2025) Using LLMs to Tackle Fundamental Challenges in Graph Learning: A Comprehensive Survey
(Liu et al., 2023) Evaluating LLMs on Graphs: Performance Insights and Comparative Analysis
(Ouyang et al., 2024) GUNDAM: Aligning LLMs with Graph Understanding
(Wei et al., 1 Nov 2025) GraphChain: LLMs for Large-scale Graph Analysis via Tool Chaining
(Bi et al., 2024) LPNL: Scalable Link Prediction with LLMs
(Ren et al., 2024) A Survey of LLMs for Graphs
(Jin et al., 2023) LLMs on Graphs: A Comprehensive Survey
(Zhang et al., 2024) Hierarchical Compression of Text-Rich Graphs via LLMs
(Wu et al., 2024) Exploring the Potential of LLMs for Heterophilic Graphs
(Li et al., 19 Feb 2025) Are LLMs In-Context Graph Learners?
(Xypolopoulos et al., 2024) Graph Linearization Methods for Reasoning on Graphs with LLMs
(Guo et al., 2024) Learning on Graphs with LLMs: A Deep Dive into Model Robustness
(Chen et al., 2023) Exploring the Potential of LLMs in Learning on Graphs
(Kyaw et al., 22 Mar 2026) Graph Fusion Across Languages using LLMs
(Sun et al., 2023) LLMs as Topological Structure Enhancers for Text-Attributed Graphs
(Jaiswal et al., 2024) All Against Some: Efficient Integration of LLMs for Message Passing in Graph Neural Networks