ReaL-TG-4B: Explainable Temporal Graph Forecasting
- The paper introduces ReaL-TG-4B, a model fine-tuned with reinforcement learning to deliver accurate link predictions and explicit reasoning traces on temporal graphs.
- ReaL-TG-4B is a fine-tuned language model that encodes temporal graphs via natural language, employing the T-CGS algorithm to prioritize recent interactions.
- The model achieves state-of-the-art performance with transparent, human-readable explanations, proving effective in applications like fraud detection and recommendation systems.
ReaL-TG-4B is a LLM fine-tuned for explainable link forecasting on temporal graphs, leveraging a reinforcement learning framework to optimize both predictive accuracy and the quality of model-generated reasoning traces. Developed by applying the Reasoning-Enhanced Learning for Temporal Graphs (ReaL-TG) algorithm to the Qwen3-4B base, ReaL-TG-4B produces competitive predictions with explicit human-readable rationales, surpassing much larger state-of-the-art LLMs on standard temporal graph benchmarks while offering strong interpretability (Ding et al., 31 Aug 2025).
1. Architecture and Graph Encoding
ReaL-TG-4B is constructed by fine-tuning the Qwen3-4B pretrained model to operate on temporal graph link forecasting tasks. Its input prompt is generated from a “temporal context graph” (), extracted via the Temporal Context Graph Selection (T-CGS) algorithm. T-CGS performs an -temporal random walk over the original temporal graph, employing a decay factor to prioritize recent or relevant interactions for the link prediction query.
The model encodes the induced subgraph as text, verbalizing nodes, edges, and time-stamped interactions together with a natural language question. Prompts systematically instruct the model to output its deductive process within > ...
tags and its formal answer within <answer>... </answer>
tags, explicitly decoupling reasoning from prediction and exposing the model’s internal rationale.
2. Reinforcement Learning Optimization
ReaL-TG-4B is trained under a reinforcement learning paradigm, where the model acts as a “policy” generating full textual outputs (i.e., both prediction and explicit reasoning trace). The reward function for each rollout is based on the outcome—specifically, the F1 score between the set of nodes predicted in the <answer>...</answer>
segments and the set of ground-truth destination nodes:
The optimization employs grouped regularized policy optimization (GRPO), performing update steps at the token level within response groups. The per-token advantage for sample at token is:
A Kullback-Leibler divergence penalty is included to regularize model updates and prevent deviation from the original LLM priors. The full GRPO objective is given as:
Since the reward is computed exclusively via outcome (not stepwise or via “teacher-forcing”), ReaL-TG-4B is incentivized to self-explore various reasoning strategies, converging toward those that maximize prediction accuracy and explanation quality.
3. Explainability and Evaluation of Reasoning Traces
A primary innovation is the explicit prompting and evaluation of reasoning traces. ReaL-TG-4B is compelled to articulate logical rationales supporting its predictions as part of its generated output. This design “opens the black box” of LLM-generated graph reasoning by coupling answer generation with mandatory explanation.
The model’s explanations are systematically evaluated by dedicated metrics:
- Faithfulness (): The proportion of atomic claims supported by the input context.
- Logical Consistency (): A normalized score (0–1) measuring coherence and logical progression within the reasoning.
- Answer–Explanation Alignment (): The ratio of predictions explicitly justified by preceding reasoning.
An LLM-as-a-Judge evaluation system (GPT-4.1 mini in reported experiments) is configured with a custom prompt template, automatically scoring each output along these criteria on a per-example basis and aggregating results over the test set.
4. Evaluation Methodology and Benchmark Results
Performance is benchmarked in two major axes: prediction quality and explanation quality.
- Link forecasting performance is measured with Mean Reciprocal Rank (MRR) and a penalized variant, pMRR, which penalizes the model for over-generation by elevating the scores of false positives appearing outside the ground-truth set.
- Reasoning trace evaluation uses the LLM-as-a-Judge protocol to assign , , and scores. These metrics are averaged over multiple datasets for robust comparison.
ReaL-TG-4B achieves strong results, outperforming models with larger parameter counts (including GPT-5 mini and Llama 3.3-70B) on tasks over both seen (wiki, subreddit, coin, flight) and unseen (uci, enron) temporal graph datasets. For example, on the “wiki” dataset, it attains an MRR of 0.824, exceeding the scores of all baseline models of greater or equal scale.
Model | Params | Wiki MRR |
---|---|---|
ReaL-TG-4B | 4B | 0.824 |
Llama3.3-70B | 70B | <0.8 |
GPT-5 mini | ? | <0.8 |
This empirical advantage holds on both direct forecasting and structured reasoning evaluation metrics.
5. Real-World Applications
ReaL-TG-4B is specifically designed for deployment in environments requiring explainable inference about dynamic, temporally-evolving relational structures. Representative applications include:
- Recommendation systems: Where user-item link prediction and generation of model-justified recommendations enhances transparency and user trust.
- Fraud detection/financial analysis: Application to transaction networks, identifying anomalous or suspicious links with traceable explanations.
- Social network analysis: Facilitates discovery and exploration of dynamic communities, while rationalizing forecasts of social interactions.
The model’s generalization to unseen graphs, combined with its built-in reasoning trace, advocates its use in high-stakes or safety-critical domains where interpretable AI is mandated.
6. Broader Impact and Significance
ReaL-TG-4B demonstrates that outcome-driven reinforcement learning can surface both high quality predictions and explainable, human-auditable reasoning in LLM-based temporal graph forecasting tasks. Its framework is applicable to any scenario in which dynamic interactions require not only accurate prediction but interpretable rationale. By providing methods and evaluation protocols for LLM-generated reasoning in complex structured domains, it sets a precedent for scalable, explainable, and generalizable reasoning in dynamic graph analysis (Ding et al., 31 Aug 2025).
A plausible implication is that further advances in outcome-based RL frameworks and context selection methods could enable even smaller models to rival much larger LLMs—potentially with enhanced faithfulness and logical consistency—across a broader spectrum of relational reasoning problems.