This paper introduces TG-LLM, a framework designed to enhance the temporal reasoning (TR) capabilities of LLMs. It addresses the observation that LLMs often struggle with TR tasks, which require understanding complex temporal expressions and logic. The core idea is to adopt a two-step process:
- Text-to-Temporal Graph (TG) Translation: Instead of reasoning directly on the raw text, the framework first translates the input context into a structured Temporal Graph (TG). This TG acts as a latent representation, explicitly capturing entities, their relationships, and associated timestamps or temporal intervals (start/end times).
- Temporal Graph Reasoning: The LLM then performs reasoning based on this generated TG, guided by Chain of Thought (CoT) principles.
TGQA Dataset
To train the text-to-TG translation component and facilitate TR learning, the authors constructed a synthetic dataset called TGQA.
- Construction Pipeline:
- Subgraphs are extracted from the YAGO11k temporal knowledge graph.
- Entities are anonymized (replaced with random names of the same type) to prevent models from relying on memorized knowledge and encourage true reasoning.
- GPT-3.5 is used to generate narrative stories based on these anonymized subgraphs.
- Rule-based Python scripts generate diverse question-answer pairs directly from the ground-truth subgraphs. These cover various reasoning types like sequencing, duration calculation, temporal relation identification, fact extraction, simultaneity checking, and comparative analysis (See Table 2).
- A semi-automatic verification step ensures alignment between the generated story and the underlying TG, minimizing noise.
- Characteristics: TGQA is fully controllable, requires minimal supervision (only for story-TG alignment verification), provides ground-truth TGs, and features diverse question types.
TG-LLM Implementation
1. Text-to-TG Translation:
- Challenge: Real-world text lacks ground-truth TGs.
- Pipeline for Real-World Data:
1. Entity & Relation Extraction: Identify key entities and relations, often guided by the question being asked, using rules or LLMs. 2. Temporal Info Identification: Extract and normalize time expressions from the text using an LLM (e.g., GPT-3.5) followed by rule-based filtering/normalization. 3. TG Construction: Use an LLM (e.g., GPT-3.5) with few-shot In-Context Learning (ICL). Provide the story, extracted entities/relations, identified time expressions, and examples of (input, output TG) pairs. The preferred output format is a chronological list of events, separating start and end times. 4. Verification: Use a semi-automatic process (querying an LLM about events in the TG based on the story, followed by manual checks for failures) to verify the generated TG quality.
- Fine-tuning: The LLM (Llama-2 with LoRA in the paper) is fine-tuned on
(story, TG)
pairs from TGQA or generated using the pipeline above.
2. Temporal Graph Reasoning:
- Input: The generated TG, the question, and optionally, external knowledge (EK). EK involves pre-calculating basic temporal facts from the TG's timestamps (e.g.,
1947 < 1953
,1953 - 1947 = 6
) and providing them in the context. - Fine-tuning with Enhanced CoT: The LLM is fine-tuned to generate a CoT rationale followed by the final answer. Two key techniques enhance this process:
- CoT Bootstrapping:
- Generate multiple CoT candidates (
K
) for a given training instance using an LLM (e.g., GPT-3.5) prompted with manually verified ICL examples. - Filter out CoTs leading to incorrect final answers.
- Sample from the remaining correct CoTs using a weighted probability distribution (Equation 1). The score (Equation 2) balances the perplexity of the correct answer given the CoT (usefulness) and the "plausibility growth" (how much the CoT increases the probability of the correct answer vs. wrong answers - diversity). This aims for more reliable and diverse CoTs than simple best-of-N sampling or standard distillation.
1 2
P_{\text{sample}}(c_k) = \text{softmax}(\text{score}(c_k)) \text{score}(c_k) = \log P(a^* | g, e, q, c_k) + \gamma G(c_k)
- where is the correct answer, is the TG, is external knowledge, is the question, is the CoT, and is a hyperparameter.
- Graph Data Augmentation:
- Motivation: Improve robustness against errors in the predicted TG from Step 1 and prevent the model from merely memorizing graph patterns.
- Strategies: Applied during training to the ground-truth/verified TGs:
- 1. Remove Irrelevant Edges: Randomly remove events (edges) from the TG that are not mentioned in the question or the bootstrapped CoT.
- 2. Use Relation Synonyms: Replace relation names with synonyms using a predefined mapping (e.g., "married to" -> "became life partner").
- 3. Change Entity Names: Apply a global, consistent mapping to replace entity names with new random names of the same type.
- 4. Change Times: Apply a global offset to all timestamps.
- For strategies 3 and 4, corresponding changes must be made to the question, CoT, and answer in the training data.
- Architecture: The paper uses Llama-2-13B with two distinct LoRA adapters: one for text-to-TG translation and one for TG reasoning. These are trained in parallel but applied sequentially during inference.
Evaluation and Results
- Datasets: TGQA, TimeQA (easy/hard modes), TempReason (L2/L3 difficulties).
- Metrics: Exact Match (EM), Token-level F1, Perplexity-based Accuracy (Acc).
- Findings:
- The proposed CoT bootstrapping and graph data augmentation significantly improve reasoning performance and reduce CoT errors compared to baseline SFT and ICL methods (Figure 4, Table 3).
- The full TG-LLM framework (SFT-TGR) substantially outperforms strong baselines, including GPT-3.5 and GPT-4 (with ICL), across all datasets (Table 4). Notably, the Llama-2-13B based TG-LLM achieves results comparable or superior to GPT-4.
- Fine-tuning on the synthetic TGQA dataset demonstrably improves performance on other TR benchmarks (TimeQA, TempReason), indicating that the learned skills (TG translation, TG reasoning) generalize well (Figure 5).
- Ablation studies confirmed the positive impact of each component: using TGs, CoT bootstrapping, graph augmentation, and incorporating external knowledge (Table 5).
Practical Implications
- The two-step TG-LLM approach offers a practical way to improve LLM temporal reasoning by breaking down the problem. Translating text to a structured TG simplifies the subsequent reasoning task.
- The TGQA dataset construction pipeline provides a template for generating synthetic, controllable data for fine-tuning models on structured reasoning tasks, requiring minimal supervision.
- The CoT bootstrapping technique, using weighted sampling based on contrastive scores, is a practical method for generating high-quality, diverse reasoning chains for SFT, potentially applicable beyond temporal reasoning.
- Graph data augmentation strategies are crucial for robustness when dealing with potentially noisy intermediate representations (like predicted TGs) and for improving generalization.
- The framework demonstrates that smaller models (Llama-2-13B) fine-tuned with these structured reasoning techniques can achieve performance comparable to much larger models on specific reasoning tasks.
The code is available at: https://github.com/xiongsiheng/TG-LLM