Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 69 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Talk like a Graph: Encoding Graphs for Large Language Models (2310.04560v1)

Published 6 Oct 2023 in cs.LG

Abstract: Graphs are a powerful tool for representing and analyzing complex relationships in real-world applications such as social networks, recommender systems, and computational finance. Reasoning on graphs is essential for drawing inferences about the relationships between entities in a complex system, and to identify hidden patterns and trends. Despite the remarkable progress in automated reasoning with natural text, reasoning on graphs with LLMs remains an understudied problem. In this work, we perform the first comprehensive study of encoding graph-structured data as text for consumption by LLMs. We show that LLM performance on graph reasoning tasks varies on three fundamental levels: (1) the graph encoding method, (2) the nature of the graph task itself, and (3) interestingly, the very structure of the graph considered. These novel results provide valuable insight on strategies for encoding graphs as text. Using these insights we illustrate how the correct choice of encoders can boost performance on graph reasoning tasks inside LLMs by 4.8% to 61.8%, depending on the task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training, 2021.
  2. Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47, 2002.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020a.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020b.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  8. Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393, 2023.
  9. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Transformers as soft reasoners over language. arXiv preprint arXiv:2002.05867, 2020.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699, 2020.
  14. On random graphs. Publicationes Mathematicae Debrecen, 6:290–297, 1959.
  15. Retrieval augmented language model pre-training. In International conference on machine learning, pp. 3929–3938. PMLR, 2020.
  16. Exploring network structure, dynamics, and function using networkx. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2008.
  17. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  18. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  21. Patton: Language model pretraining on text-rich networks. arXiv preprint arXiv:2305.12268, 2023.
  22. Lambada: Backward chaining for automated reasoning in natural language. arXiv preprint arXiv:2212.13894, 2022.
  23. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  24. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  25. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  26. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  27. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  28. Attending to graph transformers. arXiv preprint arXiv:2302.04181, 2023.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  30. Graphworld: Fake graphs bring real insights for gnns. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  3691–3701, 2022.
  31. Unifying large language models and knowledge graphs: A roadmap, 2023.
  32. Automatic prompt optimization with” gradient descent” and beam search. arXiv preprint arXiv:2305.03495, 2023.
  33. A decade of knowledge graphs in natural language processing: A survey. arXiv preprint arXiv:2210.00105, 2022.
  34. Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Can language models solve graph problems in natural language? arXiv preprint arXiv:2305.10037, 2023.
  37. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  38. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  39. Examining the effects of degree distribution and homophily in graph learning models, 2023.
  40. Deep bidirectional language-knowledge graph pretraining, 2022.
  41. Natural language is all a graph needs. arXiv preprint arXiv:2308.07134, 2023.
  42. Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140, 2020.
  43. Exploring the mit mathematics and eecs curriculum using large language models. arXiv preprint arXiv:2306.08997, 2023a.
  44. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023b.
  45. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  46. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
  47. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022b.
Citations (76)

Summary

  • The paper introduces a novel framework that converts graph structures into text, facilitating LLMs to perform graph-based tasks.
  • It demonstrates that incident encoding and tailored prompt engineering significantly enhance performance on graph reasoning tasks.
  • Results reveal that LLMs struggle with global graph properties, underscoring limitations and pointing to the need for hybrid model approaches.

Encoding Graphs for LLMs: A Comprehensive Study

Introduction

The paper "Talk like a Graph: Encoding Graphs for LLMs" (2310.04560) presents a systematic investigation into the problem of representing graph-structured data as text for consumption by LLMs. The paper addresses a critical gap: while LLMs have demonstrated strong performance on a variety of text-based reasoning tasks, their ability to reason over graph-structured data—ubiquitous in domains such as social networks, recommender systems, and knowledge graphs—remains underexplored. The authors introduce a new benchmark, GraphQA, and conduct extensive experiments to analyze how graph encoding, prompt engineering, and graph structure affect LLM performance on fundamental graph reasoning tasks. Figure 1

Figure 1: Overview of the framework for reasoning with graphs using LLMs, highlighting the modularity of graph encoding and prompt engineering.

Graph Encoding as Text: Methodological Framework

The core technical challenge addressed is the transformation of arbitrary graphs G=(V,E)G = (V, E) into textual sequences WW suitable for LLM input. The authors formalize this as the design of a graph encoding function g:GWg: G \mapsto W and a question rephrasing function q:WWq: W \mapsto W, such that the LLM ff can be queried as A=f(g(G),q(Q))A = f(g(G), q(Q)) for a question QQ about the graph. Figure 2

Figure 2: Overview of the framework for encoding graphs via text, illustrating the mapping from graph structure to natural language representations.

The paper systematically explores a taxonomy of graph encoding strategies, varying both node and edge representations. Node encodings include integer indices, English names, character names from popular media, and alphabetic labels. Edge encodings range from explicit adjacency lists to natural language statements of relationships (e.g., "A and B are friends"). The authors also experiment with different prompt engineering heuristics, including zero-shot, few-shot, chain-of-thought (CoT), and bag prompting.

Empirical Evaluation: GraphQA Benchmark

The GraphQA benchmark comprises a suite of basic graph tasks: edge existence, node degree, node count, edge count, connected nodes, cycle check, and disconnected nodes. These tasks are designed to probe both local and global graph reasoning capabilities of LLMs.

Key Findings

1. LLMs Underperform on Basic Graph Tasks

Across all evaluated models and tasks, LLMs exhibit poor performance on basic graph reasoning, often failing to surpass simple majority baselines, especially for tasks such as edge existence and cycle detection. This highlights a fundamental limitation in the ability of LLMs to perform even elementary graph computations when provided with naïve textual encodings.

2. Graph Encoding Function Critically Impacts Performance

The choice of graph encoding function g(.)g(.) has a substantial effect on LLM accuracy. For example, incident encoding (where each node lists its neighbors) outperforms adjacency encoding for tasks like node degree and connected nodes, as it places relevant information in closer textual proximity. Integer node encodings improve arithmetic tasks, while named node encodings are advantageous for tasks with non-integer outputs.

3. Prompt Engineering and Question Framing Matter

Prompting strategies significantly influence outcomes. Zero-shot prompting suffices for simple tasks, but few-shot and CoT prompting yield improvements for more complex queries. Notably, rephrasing questions in application-specific language (e.g., "How many friends does Alice have?") consistently outperforms abstract graph-theoretic formulations.

4. Model Capacity Correlates with Graph Reasoning Ability

Larger LLMs (e.g., PaLM 62B) demonstrate improved performance on graph tasks compared to smaller variants, but even the largest models do not consistently outperform majority baselines on all tasks. The effect of scale is more pronounced for tasks requiring aggregation or multi-hop reasoning.

5. Graph Structure and Generator Influence LLM Performance

The structure of the input graph—determined by the graph generator (Erdős–Rényi, Barabási–Albert, SBM, star, path, complete)—has a marked impact on LLM accuracy. For instance, cycle detection is trivial for complete graphs but challenging for path graphs, reflecting LLMs' strong priors and susceptibility to distractors in the encoding. Figure 3

Figure 3: Samples of graphs generated with different graph generators, illustrating the diversity of structures in the GraphQA benchmark.

6. LLMs Lack a Global Model of the Graph

Tasks requiring reasoning about the absence of edges (e.g., disconnected nodes) expose a critical weakness: LLMs are unable to infer global properties not explicitly encoded in the text, achieving near-zero accuracy.

Analysis of Graph Encoding Strategies

The authors provide a detailed ranking of encoding functions across tasks and prompting methods. Incident encoding is generally optimal for most prompting strategies, except in zero-shot settings where encodings with familiar names (e.g., politicians, fictional characters) perform better. The paper also finds that distractive statements in the encoding degrade performance, especially in dense graphs. Figure 4

Figure 4: Example graph used to illustrate the output of different graph encoding functions.

Implications and Future Directions

Practical Implications

  • Black-box LLMs: The paper focuses on scenarios where LLM weights are inaccessible, emphasizing the importance of prompt and encoding design for practical deployment.
  • Task-Specific Encoding: Careful selection of encoding and question phrasing can yield performance gains of 4.8% to 61.8% on graph reasoning tasks, underscoring the need for task-aware prompt engineering.
  • Benchmarking: The GraphQA benchmark provides a valuable resource for evaluating and comparing LLMs on structured reasoning tasks.

Theoretical Implications

  • Limitations of Textual Encodings: The inability of LLMs to construct a global model of the graph from text suggests fundamental representational bottlenecks.
  • Inductive Biases: LLMs exhibit strong priors based on training data distributions, which can be maladaptive for synthetic or out-of-distribution graph structures.

Future Research Directions

  • Hybrid Architectures: Integrating explicit graph neural modules or external memory with LLMs may address the observed limitations in global reasoning.
  • Automated Encoding Search: Meta-learning or reinforcement learning approaches to discover optimal graph-to-text encodings could further improve performance.
  • Instruction Tuning: Fine-tuning LLMs on graph-structured data or augmenting pretraining corpora with synthetic graph-text pairs may enhance inductive biases for structured reasoning.

Conclusion

This paper provides a rigorous, empirical foundation for understanding how LLMs process graph-structured data when presented as text. The results demonstrate that LLM performance on graph reasoning is highly sensitive to encoding choices, prompt engineering, and graph structure. While current LLMs are not yet reliable for general graph reasoning in a black-box setting, the insights and benchmarks introduced here lay the groundwork for future advances in structured reasoning with LLMs.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Explaining “Talk like a Graph: Encoding Graphs for LLMs”

1) What is this paper about?

This paper looks at how to help LLMs—computer programs that read and write text—reason about graphs. A graph is just a way to show connections, like a map of who is friends with whom, roads between cities, or links between web pages. Because LLMs mainly read text, the authors ask: how should we write (encode) a graph as text so the LLM can understand it and answer questions about it?

2) What questions are the researchers trying to answer?

The paper focuses on three main questions, explained in simple terms:

  • How should we write down a graph as text so an LLM can use it well?
  • How should we ask the question (the prompt) so the LLM knows exactly what to do?
  • Does the shape or structure of the graph (for example, a chain vs. a web) change how well the LLM can reason about it?

They also ask: do bigger LLMs do better, and do small changes in wording or examples make a big difference?

3) How did they paper this?

The researchers treated the LLM like a “black box” (they didn’t change the inside of the model). They only changed the text they gave it. They tried two main things:

  • Different ways to write the graph as text (graph encodings). Examples include:
    • Listing edges as pairs of numbers (like “(0, 1), (1, 2)”)—think of this as a very math-like list.
    • Writing neighbors of each node (like “Node 2 is connected to 0, 1, 3…”).
    • Using natural language with names and relationships (like “James and Mary are friends”).
  • Different ways to ask the question (prompting). They tested:
    • Zero-shot: just ask the question directly.
    • Few-shot: show a few example questions and answers first.
    • Chain-of-thought (CoT): ask the model to explain its steps.
    • Variations like zero-shot CoT (“Let’s think step by step”) and a “bag prompting” tip for graphs.

They built a set of simple graph tasks, called GraphQA, to test the LLM. These are basic but important building blocks for harder problems. For example:

  • Edge existence: Is there a connection between A and B?
  • Node degree: How many connections does a node have?
  • Counting nodes or edges
  • Listing neighbors (connected nodes)
  • Cycle check: Does the graph contain a loop?
  • Disconnected nodes: Who is NOT connected to this person?

They also tried different kinds of graphs: random graphs, “star” graphs (one central hub), “path” graphs (a straight chain), complete graphs (everyone connected to everyone), and other real-world-like generators. Finally, they tested different model sizes to see if bigger models help.

4) What did they find, and why is it important?

The short version: details matter—a lot.

  • LLMs struggle with even basic graph tasks if the text is written in the wrong way. Just because an LLM is great with regular text doesn’t mean it automatically understands graphs written as sentences.
  • The way you write the graph changes performance a lot. For example:
    • Writing “who is connected to whom” (incident encoding) helped on tasks like finding a node’s degree.
    • Using numbers for nodes (like node 0, node 1) helped with counting and math-like tasks.
    • Using human names sometimes helped for simple yes/no tasks.
  • The way you ask the question matters. Asking in a natural, everyday way (“How many friends does Alex have?”) sometimes worked better than using abstract graph terms (“What is the degree of node i?”).
  • Simple tasks don’t always need step-by-step reasoning. For many basic tasks, a straightforward prompt (zero-shot) worked better than “Let’s think step by step.” But for more complex tasks, examples and chain-of-thought helped.
  • Bigger models generally did better, but size alone didn’t fix everything. If the graph was encoded poorly or the question was unclear, even large models failed.
  • The shape of the graph matters. The model’s accuracy changed a lot across different graph types. For example:
    • It often assumed cycles exist (a bias), so it did well on graphs that usually have cycles (like complete graphs) and poorly on graphs that never have cycles (like path graphs).
    • Graphs with fewer edges (shorter descriptions) were easier because there was less distracting text.
  • LLMs are bad at reasoning about things that are not explicitly written. For example, listing who is NOT connected (disconnected nodes) was very hard, because the graph text mostly mentions what does exist, not what doesn’t.
  • Picking the right encoding and prompt can boost accuracy by a lot—sometimes from a few percent up to over 60% improvement, depending on the task.

Why this matters: If you want to use LLMs to analyze networks—social media, recommendation systems, fraud rings, and more—you can’t just paste in graph data randomly. You must write it in the “language” the LLM handles best for your specific task.

5) What’s the impact of this research?

This work gives practical guidance for anyone who wants LLMs to reason about graphs using only text:

  • Choose your graph encoding based on the task (e.g., use numeric nodes for counting, neighbor lists for degree-related questions).
  • Ask questions in natural and clear ways; sometimes “real-world” phrasing beats tech-speak.
  • Provide short, focused graph descriptions to reduce distraction.
  • Add a few examples and step-by-step reasoning for harder tasks.
  • Be aware of model biases (like assuming cycles are common).
  • Don’t expect the model to infer missing information unless you prompt or encode it explicitly.

The authors also release GraphQA, a benchmark to help the community test and improve how LLMs handle graph reasoning. Overall, the paper shows that with the right “translation” of a graph into text and careful prompting, LLMs can do much better—opening the door to smarter tools for understanding complex networks in everyday domains.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 tweets and received 41 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com