GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation (2505.20416v1)

Published 26 May 2025 in cs.CL and cs.AI

Abstract: Fine-tuning for LLMs typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.

Summary

The paper introduces GraphGen, a framework that uses a knowledge graph-guided approach to generate synthetic QA data for LLM fine-tuning.
It employs a novel comprehension loss mechanism to detect weak knowledge areas and prioritize data generation for improving model calibration.
Experiments demonstrate that training on high-loss, long-tail synthetic data enhances factual accuracy and multi-hop reasoning in closed-book QA tasks.

GraphGen (2505.20416) is a framework designed to address the challenges of data scarcity and quality for supervised fine-tuning (SFT) of LLMs on knowledge-intensive tasks, particularly in closed-book question-answering (QA) settings. It proposes a knowledge graph (KG)-guided approach to generate high-quality, diverse synthetic data tailored for atomic, aggregated, and multi-hop QA scenarios.

The core idea is to leverage structured knowledge from a KG to guide the synthetic data generation process, overcoming limitations of existing LLM-based methods like factual inaccuracy, poor coverage of long-tail knowledge, superficial structure, and homogeneity.

The GraphGen pipeline consists of four main steps (illustrated in Figure 1):

Knowledge Construction: The process begins by processing raw text corpora ( $D_{\text{source}}$ ). A Synthesizer Model ( $M_{\text{synth}}$ ), typically a large, capable LLM, is used to extract entities and relationships from text fragments, constructing a fine-grained knowledge graph $G$ . Descriptions for entities and relationships are aggregated if they appear in multiple fragments. This step integrates LLMs with KGs to handle issues like long-text processing and scattered knowledge while aiming to reduce hallucination.
Comprehension Assessment: To identify knowledge gaps in the Trainee Model ( $M_{\text{train}}$ ), the target LLM for SFT, GraphGen evaluates its understanding of each knowledge point (edge) in the KG. $M_{\text{synth}}$ generates paraphrased positive and negative statements for each edge description. $M_{\text{train}}$ is then prompted to provide its confidence (softmax probability of 'yes'/'no' tokens) in these statements (Figure 2). An Expected Calibration Error (ECE)-based comprehension loss ( $\text{Loss}_C$ ) is calculated for each knowledge point (Equation 2). A higher loss indicates that $M_{\text{train}}$ 's confidence does not align with the ground truth (the statement is true in the KG), highlighting knowledge blind spots. These high-loss points are prioritized for data generation.

$C_{R_i}=\frac{1}{2n} (\sum_{j=1}^n P(t|R_{ij}) + \sum_{j=1}^{n} P(f|\neg R_{ij}))$

$\text{Loss}_{C_{R_i}} = -\frac{1}{2n}\sum_{j=1}^n log(P(t|R_{ij})) - \frac{1}{2n}\sum_{j=1}^n log(P(f|\neg R_{ij}))$
Graph Organization: Subgraphs are extracted from the KG to serve as the basis for generating QA pairs. Algorithm 1 details a $k$ -hop subgraph extraction process. Strategies are employed to control subgraph composition, balancing complexity and relevance. These include depth strategy (max $k$ -hop depth), length strategy (constraining the total token count of entity/relation descriptions within the subgraph, $pre\_length$ ), and selection strategy (e.g., prioritizing edges with high comprehension loss, low loss, or random selection). Prioritizing high-loss edges helps target the Trainee Model's weak areas.
QA Generation: The extracted subgraphs are converted into diverse QA pairs using $M_{\text{synth}}$ $M_{synth}$ .
- Atomic QA: For simple subgraphs (single node/edge), $M_{\text{synth}}$ generates a basic QA pair.
- Aggregated QA: For more complex subgraphs, $M_{\text{synth}}$ first synthesizes the information into a coherent answer text and then generates a corresponding question.
- Multi-hop QA: $M_{\text{synth}}$ is prompted to generate QA pairs that require multi-step reasoning across multiple knowledge points in the subgraph.

Implementation and Practical Aspects:

Models: The paper uses Qwen2.5-7B-Instruct as the Trainee Model and Qwen2.5-72B-Instruct as the Synthesizer Model, demonstrating the approach with representative open-source LLMs.
Datasets and Scenarios: Experiments are conducted on SeedEval (Agriculture, Atomic QA), PQArefEval (Medical, Aggregated QA), and HotpotEval (General, Multi-hop QA). Source texts ( $D_{\text{source}}$ ) are used for KG construction and data synthesis, while separate test sets ( $D_{\text{eval}}$ ) evaluate the post-SFT model ( $M_f$ ).
Evaluation: Data quality is assessed using metrics like MTLD (lexical diversity), UniEval (Naturalness, Coherence, Understandability), and Reward scores. The effectiveness of the synthetic data for SFT is measured by the performance of $M_f$ on knowledge-intensive QA tasks using ROUGE-F.
Resource Requirements: Generating approximately 50,000 data entries takes about 2 hours with Qwen2.5-72B-Instruct, and SFT on Qwen2.5-7B-Instruct takes about 1 hour, using 8 NVIDIA A100 40GB GPUs. The computational cost of KG construction and processing for large corpora is noted as a limitation.
Targeting Long-Tail Knowledge: The comprehension loss mechanism allows GraphGen to identify and prioritize the generation of data corresponding to knowledge points where the Trainee Model is less confident or accurate. Experiments show that training on data sorted by higher comprehension loss values (top 30%) yields better performance gains than training on lower-loss data (bottom 30%). This highlights the practical value of targeting long-tail or less-mastered knowledge.
Improved Understanding: The reduction in comprehension loss for $M_f$ after SFT demonstrates that training with GraphGen data improves the model's internal knowledge calibration and understanding, not just its ability to answer specific questions.
Scalability: The scaling law analysis indicates that while increasing data size generally helps, prioritizing data based on comprehension loss (hard examples) is more efficient than simply using more data, especially for common knowledge the model already understands.
Trade-offs and Ablations:
- Using relationships or entities+relationships is more effective than just entities for atomic QA, as relationships better capture knowledge properties.
- The maximum premise length ( $pre\_length$ ) constraint impacts performance. Surprisingly, a smaller constraint (e.g., 256) led to better evaluation performance than larger ones (e.g., 1024), possibly due to data length distribution affecting convergence time.
- The specific edge selection strategy (max_loss, min_loss, random) had minimal impact on final performance, suggesting that the benefit comes from exploring varied knowledge points within subgraphs rather than the precise order of selection.
Deployment: The generated synthetic data is used to fine-tune a potentially smaller or domain-specific model ( $M_{\text{train}}$ ), which can then be deployed for the target QA tasks.
Extensibility: The framework includes optional modules like entity enrichment via Wikipedia and coreference resolution to improve KG quality and text coherence. A user interface is described for configuring parameters.

Limitations:

High computational cost for KG construction and processing.
Currently evaluated primarily on closed-book QA and three domains; adaptability to other tasks (math, coding) and domains needs exploration.
Requires careful tuning of the balance between synthetic and real data during SFT.
Does not directly address open-book QA (RAG), although integration is a potential future direction.

In summary, GraphGen offers a practical framework for generating effective synthetic data for LLM SFT by leveraging structured knowledge from KGs and dynamically identifying the Trainee Model's knowledge gaps using comprehension loss. This approach leads to improved factual accuracy, coverage of long-tail knowledge, structured reasoning capabilities (especially multi-hop), and data diversity compared to existing synthetic data methods, providing a viable solution to the data bottleneck in fine-tuning LLMs for knowledge-intensive applications.

PDF Markdown

GitHub

GitHub - open-sciencelab/GraphGen: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation (171 stars)

GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation (2505.20416v1)

Summary

Related Papers

GitHub