Temp-R1: Autonomous Temporal KG QA Framework

Updated 1 February 2026

Temp-R1 is a unified, autonomous agent framework for Temporal Knowledge Graph Question Answering that facilitates complex multi-hop reasoning.
It uses an 8B-parameter decoder-only model with a structured internal-external action space to perform fine-grained planning, filtering, and retrieval.
Reverse curriculum learning paired with GRPO optimization results in significant performance improvements on multiple TKGQA benchmarks.

Temp-R1 is a unified, autonomous agent framework for Temporal Knowledge Graph Question Answering (TKGQA), designed to enable complex multi-hop temporal reasoning via reinforcement learning. It introduces a specialized internal-external action space and reverse curriculum learning, resulting in a state-of-the-art, end-to-end, self-supervised pipeline capable of solving sophisticated queries on temporal knowledge graphs without reliance on closed-source tools (Gong et al., 26 Jan 2026).

1. Agent Architecture and Action Space

Temp-R1 employs an 8B-parameter decoder-only LLM (e.g., Llama-3.1-8B-Instruct) as its base, operating within a Markov Decision Process (MDP) formalism. At each timestep, the agent observes its state $s_t$ —defined by the question and full dialogue/action history—and selects actions from a structured, expanded action space:

Internal reasoning actions: <plan>, <filter>, <rank>
External tool actions: <search> (calls a retriever)
Termination: <answer> (outputs the final answer; episode ends)

Mathematically, the action space is $\A = \A_{\rm internal} \cup \A_{\rm external} \cup \A_{\rm term}$, where $\A_{\rm internal} = \{\texttt{<plan>},\texttt{<filter>},\texttt{<rank>}\}$, $\A_{\rm external} = \{\texttt{<search>}\}$, and $\A_{\rm term} = \{\texttt{<answer>}\}$. This decomposition supports fine-grained planning, retrieval, constraint filtering, and ranking over temporal facts. The episode proceeds until the agent issues the <answer> action, after which a binary reward (correct/incorrect answer) is assigned.

The environment transitions are defined such that internal actions update the dialogue state with structured markers, external actions invoke the retriever and append new factual observations, and episodes terminate at the answer. There is no intermediate reward signal during the reasoning trace; only terminal outcome is rewarded.

2. Reverse Curriculum Learning

Temp-R1 introduces a “hard-first” reverse curriculum learning schedule. Conventional RL or behavioral cloning on mixed-difficulty TKGQA datasets allows shortcut or degenerate search-and-answer policies to saturate performance on single-hop queries, without acquiring robust multi-hop reasoning. Temp-R1 circumvents this by:

Restricting initial RL training exclusively to the hard multi-hop subset ($\D_{\rm hard}$) for a warm-up phase ( $t \leq T_0$ ), thereby enforcing exploration and mastery of recursive reasoning steps.
Only after this phase ( $t > T_0$ ), the easy single-hop ($\D_{\rm easy}$) questions are reintroduced, allowing generalization and “transfer down” from complex to simple settings.

Formally, the data distribution over question difficulty evolves as: $P_{\rm curr}(d=1)=1,\, P_{\rm curr}(d=0)=0\quad (t\leq T_0),\qquad P_{\rm curr}(d=1) = P_{\rm curr}(d=0) = \frac12 \quad (t > T_0)$ where $d(q)$ is a binary difficulty label. This approach was empirically necessary for the agent to acquire non-trivial temporal reasoning capabilities.

3. Training and Optimization

Training is two-phased:

Supervised Fine-Tuning (SFT): The LM is first cold-started by mimicking $\sim$ 1,000 high-quality reasoning traces (“trajectories”) generated by GPT-4o, ensuring the agent learns action formatting and sub-question planning. A masked cross-entropy loss is used, focusing solely on the action-tagged tokens.
Reinforcement Learning (RL): The model is then optimized with Group Relative Policy Optimization (GRPO), a variant of PPO leveraging group-normalized advantages. For each sampled query, $G$ rollouts are performed; relative advantages are calculated versus the batch mean and variance: $\hat A_i = \frac{r_i - \bar{r}}{\sqrt{\frac{1}{G}\sum_{k=1}^G (r_k - \bar{r})^2} + \eta}$ with $r_i \in \{0,1\}$ the terminal reward. The GRPO surrogate objective includes clipped importance weighting and a KL penalty with respect to a reference policy.

No replay buffer or explicit value network is required; policy gradients are computed on-the-fly within each group batch.

4. Empirical Results and Ablation

Temp-R1 achieved state-of-the-art results on principal TKGQA benchmarks:

MultiTQ (Hits@1): overall 0.780; challenging “multiple” category (multi-hop): 0.550 vs 0.409 for previous SOTA (PoK), a 19.8% improvement.
TimelineKGQA: CronQuestion-KG (in-domain): 0.705 vs 0.651 (best baseline); ICEWS-Actor (out-of-domain): 0.642 vs 0.602.

Key ablations on MultiTQ demonstrated necessity of each component:

w/o internal actions: overall 0.620 (–21%); multiple 0.388
w/o reverse curriculum: overall 0.556 (–29%); multiple 0.143
w/o SFT: overall 0.582; multiple 0.325

This suggests both the structured action space and hard-first training schedule are critical for robust temporal reasoning.

5. High-Level Training Workflow and Hyperparameters

The high-level pseudocode is as follows:

Input: Pretrained LM θ₀, SFT dataset D_sft, full TKGQA dataset D, curriculum threshold T₀,
       GRPO hyperparams {G, ε, β, η, lr_actor}

1. Supervised fine-tuning:
   θ ← θ₀
   for epoch in 1..2:
     for (q, τ_gold) in D_sft:
       compute masked loss L_sft
       θ ← θ – lr_sft·∇θ L_sft

2. RL with Reverse Curriculum:
   for step in 1..N_steps:
     if step ≤ T₀:
       Q ← batch from D_hard
     else:
       Q ← batch from D_hard ∪ D_easy
     for q in Q:
       for i in 1..G:
         τ_i ← rollout π_θ on q
         r_i ← 1 if answer correct else 0
       compute group advantages Â_i, accumulate loss
     θ ← θ + lr_actor·∇θ J_GRPO

Key hyperparameters:

SFT: lr = 2×10⁻⁵, batch = 16, epochs = 2, warmup = 0.1
RL: group size G = 5, clip ε = 0.2, KL penalty β = 0.01, lr_actor = 5×10⁻⁷, grad_clip = 5.0, vLLM temperature = 1.0, T₀ sets hard-only phase at ~10,000 RL steps

At inference, the trained Temp-R1 model operates fully autonomously, no external APIs required.

6. Impact and Future Directions

Temp-R1 establishes that fine-grained control over the agent’s reasoning operations (via structured internal action tags) and difficult-first reverse curriculum enable the emergence of non-trivial, generalizable skills for TKGQA. Ablations highlight a strong synergy between action-space expansion and curriculum design.

A plausible implication is that reverse curriculum RL schedules may benefit other domains where shortcut solutions dominate early learning. The Temp-R1 paradigm supports end-to-end, open-source TKGQA, offering a pathway to scalable, fully autonomous temporal reasoning agents beyond the limitations of specialist API chains or static planners (Gong et al., 26 Jan 2026).

Markdown Upgrade to Chat

References (1)

Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temp-R1.