Papers
Topics
Authors
Recent
2000 character limit reached

TKG-Thinker: Temporal KGQA Agent Architecture

Updated 8 February 2026
  • TKG-Thinker is an agentic architecture for temporal KGQA that integrates multi-turn planning, tool-augmented retrieval, and reinforcement learning.
  • It employs a dual-training strategy combining supervised fine-tuning and RL to handle complex, multi-constraint temporal queries.
  • Experimental results show significant Hits@1 gains and enhanced multi-hop reasoning compared to prior static prompting methods.

TKG-Thinker designates a class of agentic architectures for Temporal Knowledge Graph Question Answering (TKGQA) that leverage autonomous planning, multi-turn dynamic retrieval, and reinforcement learning to achieve state-of-the-art temporal reasoning over knowledge graphs (KGs). TKG-Thinker agents operate in environments where queries involve time-sensitive facts and require grounding in evidence stored as temporal quadruples (subject, relation, object, timestamp). By integrating explicit step-by-step planning, adaptive tool use, and policy optimization, TKG-Thinker systems address known limitations of static prompting and decomposition-based approaches, especially in handling multi-constraint temporal queries (Jiang et al., 5 Feb 2026, &&&1&&&).

1. TKGQA Challenges and Motivation

The Temporal KGQA problem is defined over a temporal KG G\mathcal{G}, with facts as (s,r,o,t)(s, r, o, t). The task is: given a natural language question QQ with explicit or implicit temporal constraints, identify the correct response by grounding QQ in G\mathcal{G}. Queries of interest include multi-entity, multi-hop, and compound temporal conditions such as “Which country negotiated last with entity BB before AA?” or “Who was head of state before XX and after YY?”

Prior KGQA methods—embedding-based, semantic-parsing, or standard LLM prompting—suffer key deficits:

  • Hallucinations and Temporal Drift: LLM-only approaches may hallucinate or misalign answers with real KG evidence, especially when navigating complex temporal operators (“before,” “between,” “last”).
  • Compositionality Breakdowns: Existing decomposition or retrieval-augmentation frameworks propagate errors across substeps, failing on compound or multi-hop constraints.
  • Statically Engineered Pipelines: Prompting with fixed chains or retrieval only provides limited model autonomy, lacking self-correction or adaptive planning capacities.

TKG-Thinker is designed to address these deficits by modeling TKGQA as an agent-environment interaction, enabling multi-turn evidence gathering, reasoning, and self-verification.

2. Agentic Architecture: Dual-Training and Tool-Augmentation

TKG-Thinker utilizes a two-stage dual-training strategy, combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), coupled with a planner-executor action loop and an interface to temporal search tools (Jiang et al., 5 Feb 2026).

  • SFT with Tool-Augmented Chain-of-Thought: The model is first trained on expert-generated trajectories where each example alternates between planning (“Plan: …”), reasoning (“Thought: …”), tool calls (e.g., Search_time, Search_before), tool observations, and final answer.
  • RL Optimization under Multi-Dimensional Rewards: After SFT, the policy is refined by interacting with the real TKG environment, receiving composite feedback on answer correctness, retrieval validity, and stepwise format.

A typical interaction loop alternates “internal reasoning” actions (Plan, Think) with “tool calls” (e.g., Search_time, Search_before, Search_between), using retrieved evidence to refine the answer proposal. The agent observes the full interaction history HnH_n and generates the next plan or retrieval action until an “Answer” action is emitted or a budget is exhausted.

Component Function Notes
SFT (Stage 1) Mimic expert trajectories with tool use Bootstraps planning and tool protocols
RL (Stage 2) Policy refinement using PPO/GRPO Uses multi-dimensional rewards
Action Space Plan, Think, Search_time, ..., Answer Explicit alternation of planning and tools
Temporal Retriever Search-specific interfaces for time-sensitive queries E.g., Search_before, Search_between
Environment Temporal KG (G\mathcal{G}) as structured tool APIs Agent observes only through tool queries

3. Mathematical Formulation of Multi-Turn RL Optimization

In the RL stage, the TKGQA task is cast as a sequential decision process:

  • State: st=Hts_t = H_t, accumulated plan, action, and observation history.
  • Action: ata_t chosen from A={\mathcal{A} = \{Plan, Think, Search_time, ... , Answer}\}.
  • Rewards: At episode end (when “Answer” is called), three binary sub-rewards:
    • routr_{\text{out}}: Exact match of apreda_{\text{pred}} to gold answer.
    • rretr_{\text{ret}}: Retrieved evidence covers the answer.
    • rfmtr_{\text{fmt}}: Format validity per protocol.

A combined terminal reward RallR_{\text{all}} is defined as:

Rall=rout[1(1rfmt)λ]+(1rout)[αrfmt+γrret]+(1rout)δ(1rfmt)R_{\text{all}} = r_{\text{out}} \cdot [1 - (1 - r_{\text{fmt}})\lambda] + (1 - r_{\text{out}})[\alpha r_{\text{fmt}} + \gamma r_{\text{ret}}] + (1 - r_{\text{out}})\delta(1 - r_{\text{fmt}})

Parameters α,γ,λ,δ\alpha,\gamma,\lambda,\delta control the tradeoffs between answer quality, protocol compliance, and evidence coverage.

Policy training is performed via PPO or Group-Relative PPO (GRPO):

J(θ)=EQ,{yi}πold[1Gi=1Gfϵ(ρi(θ),A^i)]βEQ[DKL(πθ(Q)πref(Q))]J(\theta) = \mathbb{E}_{Q, \{y_i\} \sim \pi_\text{old}} \left[ \frac{1}{G} \sum_{i=1}^G f_\epsilon(\rho_i(\theta), \hat{A}_i)\right] - \beta \mathbb{E}_Q [ D_{\text{KL}}\left(\pi_\theta(\cdot | Q)\| \pi_{\text{ref}}(\cdot|Q)\right)]

where fϵf_\epsilon is the PPO clipping objective, A^i\hat{A}_i the advantage estimated from RallR_{\text{all}}, and β\beta a KL penalty weight.

4. Experimental Results and Empirical Analysis

TKG-Thinker has been evaluated on MULTITQ and CronQuestions (large-scale TKGQA datasets covering both atomic and multi-hop/hard temporal queries), with the following principal observations (Jiang et al., 5 Feb 2026):

  • Performance: On MULTITQ, TKG-Thinker (SFT+GRPO\text{SFT}+\text{GRPO}, Qwen2.5-7B backend) achieves Hits@1 of 0.855, surpassing the strongest prior baseline (PoK, 0.779) by +7.6% absolute.
  • Complexity Handling: TKG-Thinker delivers up to +29.7% absolute Hits@1 improvements over prior systems on multiple-step or “Complex” queries.
  • Generalization: When evaluated out-of-domain on Timeline-KGQA variants, TKG-Thinker outperforms RTQA and retrieval-augmented generation (RAG) baselines on “Medium” and “Complex” queries.
  • Ablation Effects: Removing the SFT stage results in a 26.4% drop; removing Plan actions reduces multi-step performance by 5.9%; removing temporal retrievers collapses accuracy by nearly 40%, indicating the necessity of temporal grounding.
Model MULTITQ Hits@1 CronQuestions Hits@1 Multi-step Gain
PoK (best prior) 0.779 0.863 baseline
TKG-Thinker 0.855 0.893 up to +29.7%

5. Comparison with Recursive Decomposition Approaches

The TKG-Thinker pattern can be contrasted with “recursive thinking” systems such as RTQA, which decompose the query into a tree structure, solve atomic subquestions via LLM+Retriever, and combine candidate answers via multi-path aggregation (Gong et al., 4 Sep 2025):

  • Planning Granularity: RTQA relies on a decomposition module using prompt-specialized LLM calls to build a tree of subquestions. TKG-Thinker’s planning is learned via SFT and refined through RL over discrete Plan actions and real-time tool feedback.
  • Answer Aggregation: RTQA employs multi-path aggregation for robustness against substep failures; TKG-Thinker’s self-verifying plan and tool use reduce error propagation by explicit verification in the interaction loop.
  • Training: RTQA is training-free (plug-and-play over LLMs and retrievers), while TKG-Thinker requires SFT and RL optimization but yields stronger empirical gains, especially on multi-hop/complex queries.
Feature RTQA TKG-Thinker
Decomposition Prompt-based, tree-structured SFT-learned, stepwise plan
Aggregation Multi-path answer fusion Interaction + evidence check
Training Zero-shot (prompt only) SFT + RL (PPO/GRPO)
Temporal Retriever Yes Specialized tool calls
Adaptivity Static tree Dynamic multi-turn agent

6. Significance, Limitations, and Future Directions

TKG-Thinker demonstrates the utility of casting TKGQA as an agentic, RL-optimized interaction, in which planning, evidence retrieval, and verification are explicitly coordinated. Results show substantial improvements in both in-domain and generalization settings, notably in the hardest temporal reasoning categories.

Identified limitations include constraints on horizon length (typically <8<8 steps), reward sparsity (primarily terminal EM and protocol indicators), and adaptation to truly open-world or unseen event types. Open problems include:

  • Richer intermediate rewards via LLM-based judges (beyond binary EM).
  • Scaling to deep or continuous-time reasoning tasks.
  • Multi-agent extensions, e.g., separating planning and retrieval agents.
  • Application to other compositional knowledge domains beyond temporal graphs.

A plausible implication is that finer-grained RL-driven stepwise reasoning with adaptive tool use may generalize to KGQA tasks in domains involving spatial, causal, or multi-attribute reasoning, provided suitable toolkits and reward definitions are supplied (Jiang et al., 5 Feb 2026, Gong et al., 4 Sep 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TKG-Thinker.