Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models Can Learn Temporal Reasoning (2401.06853v6)

Published 12 Jan 2024 in cs.CL

Abstract: While LLMs have demonstrated remarkable reasoning capabilities, they are not without their flaws and inaccuracies. Recent studies have introduced various methods to mitigate these limitations. Temporal reasoning (TR), in particular, presents a significant challenge for LLMs due to its reliance on diverse temporal concepts and intricate temporal logic. In this paper, we propose TG-LLM, a novel framework towards language-based TR. Instead of reasoning over the original context, we adopt a latent representation, temporal graph (TG) that enhances the learning of TR. A synthetic dataset (TGQA), which is fully controllable and requires minimal supervision, is constructed for fine-tuning LLMs on this text-to-TG translation task. We confirmed in experiments that the capability of TG translation learned on our dataset can be transferred to other TR tasks and benchmarks. On top of that, we teach LLM to perform deliberate reasoning over the TGs via Chain-of-Thought (CoT) bootstrapping and graph data augmentation. We observed that those strategies, which maintain a balance between usefulness and diversity, bring more reliable CoTs and final results than the vanilla CoT distillation.

This paper introduces TG-LLM, a framework designed to enhance the temporal reasoning (TR) capabilities of LLMs. It addresses the observation that LLMs often struggle with TR tasks, which require understanding complex temporal expressions and logic. The core idea is to adopt a two-step process:

  1. Text-to-Temporal Graph (TG) Translation: Instead of reasoning directly on the raw text, the framework first translates the input context into a structured Temporal Graph (TG). This TG acts as a latent representation, explicitly capturing entities, their relationships, and associated timestamps or temporal intervals (start/end times).
  2. Temporal Graph Reasoning: The LLM then performs reasoning based on this generated TG, guided by Chain of Thought (CoT) principles.

TGQA Dataset

To train the text-to-TG translation component and facilitate TR learning, the authors constructed a synthetic dataset called TGQA.

  • Construction Pipeline:
    • Subgraphs are extracted from the YAGO11k temporal knowledge graph.
    • Entities are anonymized (replaced with random names of the same type) to prevent models from relying on memorized knowledge and encourage true reasoning.
    • GPT-3.5 is used to generate narrative stories based on these anonymized subgraphs.
    • Rule-based Python scripts generate diverse question-answer pairs directly from the ground-truth subgraphs. These cover various reasoning types like sequencing, duration calculation, temporal relation identification, fact extraction, simultaneity checking, and comparative analysis (See Table 2).
    • A semi-automatic verification step ensures alignment between the generated story and the underlying TG, minimizing noise.
  • Characteristics: TGQA is fully controllable, requires minimal supervision (only for story-TG alignment verification), provides ground-truth TGs, and features diverse question types.

TG-LLM Implementation

1. Text-to-TG Translation:

  • Challenge: Real-world text lacks ground-truth TGs.
  • Pipeline for Real-World Data:

1. Entity & Relation Extraction: Identify key entities and relations, often guided by the question being asked, using rules or LLMs. 2. Temporal Info Identification: Extract and normalize time expressions from the text using an LLM (e.g., GPT-3.5) followed by rule-based filtering/normalization. 3. TG Construction: Use an LLM (e.g., GPT-3.5) with few-shot In-Context Learning (ICL). Provide the story, extracted entities/relations, identified time expressions, and examples of (input, output TG) pairs. The preferred output format is a chronological list of events, separating start and end times. 4. Verification: Use a semi-automatic process (querying an LLM about events in the TG based on the story, followed by manual checks for failures) to verify the generated TG quality.

  • Fine-tuning: The LLM (Llama-2 with LoRA in the paper) is fine-tuned on (story, TG) pairs from TGQA or generated using the pipeline above.

2. Temporal Graph Reasoning:

  • Input: The generated TG, the question, and optionally, external knowledge (EK). EK involves pre-calculating basic temporal facts from the TG's timestamps (e.g., 1947 < 1953, 1953 - 1947 = 6) and providing them in the context.
  • Fine-tuning with Enhanced CoT: The LLM is fine-tuned to generate a CoT rationale followed by the final answer. Two key techniques enhance this process:
    • CoT Bootstrapping:
    • Generate multiple CoT candidates (K) for a given training instance using an LLM (e.g., GPT-3.5) prompted with manually verified ICL examples.
    • Filter out CoTs leading to incorrect final answers.
    • Sample from the remaining correct CoTs using a weighted probability distribution (Equation 1). The score (Equation 2) balances the perplexity of the correct answer given the CoT (usefulness) and the "plausibility growth" (how much the CoT increases the probability of the correct answer vs. wrong answers - diversity). This aims for more reliable and diverse CoTs than simple best-of-N sampling or standard distillation.
    • 1
      2
      
      P_{\text{sample}}(c_k) = \text{softmax}(\text{score}(c_k))
      \text{score}(c_k) = \log P(a^* | g, e, q, c_k) + \gamma G(c_k)
    • where aa^* is the correct answer, gg is the TG, ee is external knowledge, qq is the question, ckc_k is the CoT, and γ\gamma is a hyperparameter.
    • Graph Data Augmentation:
    • Motivation: Improve robustness against errors in the predicted TG from Step 1 and prevent the model from merely memorizing graph patterns.
    • Strategies: Applied during training to the ground-truth/verified TGs:
    • 1. Remove Irrelevant Edges: Randomly remove events (edges) from the TG that are not mentioned in the question or the bootstrapped CoT.
    • 2. Use Relation Synonyms: Replace relation names with synonyms using a predefined mapping (e.g., "married to" -> "became life partner").
    • 3. Change Entity Names: Apply a global, consistent mapping to replace entity names with new random names of the same type.
    • 4. Change Times: Apply a global offset to all timestamps.
    • For strategies 3 and 4, corresponding changes must be made to the question, CoT, and answer in the training data.
  • Architecture: The paper uses Llama-2-13B with two distinct LoRA adapters: one for text-to-TG translation and one for TG reasoning. These are trained in parallel but applied sequentially during inference.

Evaluation and Results

  • Datasets: TGQA, TimeQA (easy/hard modes), TempReason (L2/L3 difficulties).
  • Metrics: Exact Match (EM), Token-level F1, Perplexity-based Accuracy (Acc).
  • Findings:
    • The proposed CoT bootstrapping and graph data augmentation significantly improve reasoning performance and reduce CoT errors compared to baseline SFT and ICL methods (Figure 4, Table 3).
    • The full TG-LLM framework (SFT-TGR) substantially outperforms strong baselines, including GPT-3.5 and GPT-4 (with ICL), across all datasets (Table 4). Notably, the Llama-2-13B based TG-LLM achieves results comparable or superior to GPT-4.
    • Fine-tuning on the synthetic TGQA dataset demonstrably improves performance on other TR benchmarks (TimeQA, TempReason), indicating that the learned skills (TG translation, TG reasoning) generalize well (Figure 5).
    • Ablation studies confirmed the positive impact of each component: using TGs, CoT bootstrapping, graph augmentation, and incorporating external knowledge (Table 5).

Practical Implications

  • The two-step TG-LLM approach offers a practical way to improve LLM temporal reasoning by breaking down the problem. Translating text to a structured TG simplifies the subsequent reasoning task.
  • The TGQA dataset construction pipeline provides a template for generating synthetic, controllable data for fine-tuning models on structured reasoning tasks, requiring minimal supervision.
  • The CoT bootstrapping technique, using weighted sampling based on contrastive scores, is a practical method for generating high-quality, diverse reasoning chains for SFT, potentially applicable beyond temporal reasoning.
  • Graph data augmentation strategies are crucial for robustness when dealing with potentially noisy intermediate representations (like predicted TGs) and for improving generalization.
  • The framework demonstrates that smaller models (Llama-2-13B) fine-tuned with these structured reasoning techniques can achieve performance comparable to much larger models on specific reasoning tasks.

The code is available at: https://github.com/xiongsiheng/TG-LLM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. A dataset for answering time-sensitive questions.
  2. Timebench: A comprehensive evaluation of temporal reasoning abilities in large language models.
  3. Hyte: Hyperplane-based temporally aware knowledge graph embedding. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2001–2011.
  4. Learning sequence encoders for temporal knowledge graph completion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4816–4821, Brussels, Belgium. Association for Computational Linguistics.
  5. Wes Gurnee and Max Tegmark. 2023. Language models represent space and time.
  6. Is neuro-symbolic ai meeting its promises in natural language processing? a structured review. Semantic Web, (Preprint):1–42.
  7. Large language models are reasoning teachers.
  8. Lora: Low-rank adaptation of large language models.
  9. Do large language models know about facts?
  10. Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey.
  11. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  12. Recallm: An adaptable memory mechanism with temporal understanding for large language models.
  13. Kalev Leetaru and Philip A Schrodt. 2013. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA annual convention, volume 2, pages 1–49. Citeseer.
  14. Retrieval-augmented generation for knowledge-intensive nlp tasks.
  15. Unlocking temporal question answering for large language models using code execution.
  16. Grounding complex natural language commands for temporal tasks in unseen environments.
  17. Tlogic: Temporal logical rules for explainable link forecasting on temporal knowledge graphs.
  18. Chatrule: Mining logical rules with large language models for knowledge graph reasoning.
  19. Joint reasoning for temporal and causal relations.
  20. Torque: A reading comprehension dataset of temporal ordering questions.
  21. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114.
  22. Time is encoded in the weights of finetuned language models.
  23. OpenAI. 2023. Gpt-4 technical report.
  24. Training language models to follow instructions with human feedback.
  25. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning.
  26. Ali Payani and Faramarz Fekri. 2019a. Inductive logic programming via differentiable deep neural logic networks.
  27. Ali Payani and Faramarz Fekri. 2019b. Learning algorithms via neural logic networks.
  28. Lis Kanashiro Pereira. 2022. Attention-focused adversarial training for robust temporal reasoning. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7352–7359.
  29. Timedial: Temporal commonsense reasoning in dialog.
  30. Are large language models temporally grounded?
  31. Rnnlogic: Learning logic rules for reasoning on knowledge graphs.
  32. Time masking for temporal language models.
  33. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought.
  34. Neuro-symbolic artificial intelligence. AI Communications, 34(3):197–209.
  35. Neuro-symbolic ai: An emerging class of ai workloads and their characterization.
  36. Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems, 33:20227–20237.
  37. Towards benchmarking and improving the temporal reasoning capability of large language models.
  38. Towards robust temporal reasoning of large language models via a multi-hop qa dataset and pseudo-instruction tuning.
  39. Llama: Open and efficient foundation language models.
  40. Llama 2: Open foundation and fine-tuned chat models.
  41. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
  42. Scott: Self-consistent chain-of-thought distillation.
  43. Yuqing Wang and Yun Zhao. 2023. Tram: Benchmarking temporal reasoning for large language models.
  44. Chain-of-thought prompting elicits reasoning in large language models.
  45. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  46. Menatqa: A new dataset for testing the temporal comprehension and reasoning abilities of large language models.
  47. Tilp: Differentiable learning of temporal logical rules on knowledge graphs. In The Eleventh International Conference on Learning Representations.
  48. Teilp: Time prediction over knowledge graphs via logical reasoning. arXiv preprint arXiv:2312.15816.
  49. Gentopia: A collaborative platform for tool-augmented llms.
  50. Differentiable learning of logical rules for knowledge base reasoning.
  51. Once upon a Time in Graph: Relative-time pretraining for complex temporal reasoning.
  52. Neuro-symbolic integration brings causal and reliable reasoning proofs.
  53. Learn to explain efficiently via neural logic inductive learning.
  54. Harnessing the power of large language models for natural language to first-order logic translation.
  55. Improving event duration prediction via time-aware pre-training.
  56. Tree of thoughts: Deliberate problem solving with large language models.
  57. Back to the future: Towards explainable temporal reasoning with large language models.
  58. Making large language models perform better in knowledge graph completion.
  59. "going on a vacation" takes longer than "going for a walk": A study of temporal commonsense understanding.
  60. Temporal common sense acquisition with minimal supervision.
  61. Temporal reasoning on implicit events from distant supervision.
  62. Large language models can learn rules.
  63. Toolqa: A dataset for llm question answering with external tools.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Siheng Xiong (15 papers)
  2. Ali Payani (48 papers)
  3. Ramana Kompella (27 papers)
  4. Faramarz Fekri (62 papers)
Citations (49)
X Twitter Logo Streamline Icon: https://streamlinehq.com