TKG-Thinker: Temporal KGQA Agent Architecture

Updated 8 February 2026

TKG-Thinker is an agentic architecture for temporal KGQA that integrates multi-turn planning, tool-augmented retrieval, and reinforcement learning.
It employs a dual-training strategy combining supervised fine-tuning and RL to handle complex, multi-constraint temporal queries.
Experimental results show significant Hits@1 gains and enhanced multi-hop reasoning compared to prior static prompting methods.

TKG-Thinker designates a class of agentic architectures for Temporal Knowledge Graph Question Answering (TKGQA) that leverage autonomous planning, multi-turn dynamic retrieval, and reinforcement learning to achieve state-of-the-art temporal reasoning over knowledge graphs (KGs). TKG-Thinker agents operate in environments where queries involve time-sensitive facts and require grounding in evidence stored as temporal quadruples (subject, relation, object, timestamp). By integrating explicit step-by-step planning, adaptive tool use, and policy optimization, TKG-Thinker systems address known limitations of static prompting and decomposition-based approaches, especially in handling multi-constraint temporal queries (Jiang et al., 5 Feb 2026, &&&1&&&).

1. TKGQA Challenges and Motivation

The Temporal KGQA problem is defined over a temporal KG $\mathcal{G}$ , with facts as $(s, r, o, t)$ . The task is: given a natural language question $Q$ with explicit or implicit temporal constraints, identify the correct response by grounding $Q$ in $\mathcal{G}$ . Queries of interest include multi-entity, multi-hop, and compound temporal conditions such as “Which country negotiated last with entity $B$ before $A$ ?” or “Who was head of state before $X$ and after $Y$ ?”

Prior KGQA methods—embedding-based, semantic-parsing, or standard LLM prompting—suffer key deficits:

Hallucinations and Temporal Drift: LLM-only approaches may hallucinate or misalign answers with real KG evidence, especially when navigating complex temporal operators (“before,” “between,” “last”).
Compositionality Breakdowns: Existing decomposition or retrieval-augmentation frameworks propagate errors across substeps, failing on compound or multi-hop constraints.
Statically Engineered Pipelines: Prompting with fixed chains or retrieval only provides limited model autonomy, lacking self-correction or adaptive planning capacities.

TKG-Thinker is designed to address these deficits by modeling TKGQA as an agent-environment interaction, enabling multi-turn evidence gathering, reasoning, and self-verification.

2. Agentic Architecture: Dual-Training and Tool-Augmentation

TKG-Thinker utilizes a two-stage dual-training strategy, combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), coupled with a planner-executor action loop and an interface to temporal search tools (Jiang et al., 5 Feb 2026).

SFT with Tool-Augmented Chain-of-Thought: The model is first trained on expert-generated trajectories where each example alternates between planning (“Plan: …”), reasoning (“Thought: …”), tool calls (e.g., Search_time, Search_before), tool observations, and final answer.
RL Optimization under Multi-Dimensional Rewards: After SFT, the policy is refined by interacting with the real TKG environment, receiving composite feedback on answer correctness, retrieval validity, and stepwise format.

A typical interaction loop alternates “internal reasoning” actions (Plan, Think) with “tool calls” (e.g., Search_time, Search_before, Search_between), using retrieved evidence to refine the answer proposal. The agent observes the full interaction history $H_n$ and generates the next plan or retrieval action until an “Answer” action is emitted or a budget is exhausted.

Component	Function	Notes
SFT (Stage 1)	Mimic expert trajectories with tool use	Bootstraps planning and tool protocols
RL (Stage 2)	Policy refinement using PPO/GRPO	Uses multi-dimensional rewards
Action Space	Plan, Think, Search_time, ..., Answer	Explicit alternation of planning and tools
Temporal Retriever	Search-specific interfaces for time-sensitive queries	E.g., Search_before, Search_between
Environment	Temporal KG ( $\mathcal{G}$ ) as structured tool APIs	Agent observes only through tool queries

3. Mathematical Formulation of Multi-Turn RL Optimization

In the RL stage, the TKGQA task is cast as a sequential decision process:

State: $s_t = H_t$ , accumulated plan, action, and observation history.
Action: $a_t$ chosen from $\mathcal{A} = \{$ Plan, Think, Search_time, ... , Answer $\}$ .
Rewards: At episode end (when “Answer” is called), three binary sub-rewards:
- $r_{\text{out}}$ : Exact match of $a_{\text{pred}}$ to gold answer.
- $r_{\text{ret}}$ : Retrieved evidence covers the answer.
- $r_{\text{fmt}}$ : Format validity per protocol.

A combined terminal reward $R_{\text{all}}$ is defined as:

$R_{\text{all}} = r_{\text{out}} \cdot [1 - (1 - r_{\text{fmt}})\lambda] + (1 - r_{\text{out}})[\alpha r_{\text{fmt}} + \gamma r_{\text{ret}}] + (1 - r_{\text{out}})\delta(1 - r_{\text{fmt}})$

Parameters $\alpha,\gamma,\lambda,\delta$ control the tradeoffs between answer quality, protocol compliance, and evidence coverage.

Policy training is performed via PPO or Group-Relative PPO (GRPO):

$J(\theta) = \mathbb{E}_{Q, \{y_i\} \sim \pi_\text{old}} \left[ \frac{1}{G} \sum_{i=1}^G f_\epsilon(\rho_i(\theta), \hat{A}_i)\right] - \beta \mathbb{E}_Q [ D_{\text{KL}}\left(\pi_\theta(\cdot | Q)\| \pi_{\text{ref}}(\cdot|Q)\right)]$

where $f_\epsilon$ is the PPO clipping objective, $\hat{A}_i$ the advantage estimated from $R_{\text{all}}$ , and $\beta$ a KL penalty weight.

4. Experimental Results and Empirical Analysis

TKG-Thinker has been evaluated on MULTITQ and CronQuestions (large-scale TKGQA datasets covering both atomic and multi-hop/hard temporal queries), with the following principal observations (Jiang et al., 5 Feb 2026):

Performance: On MULTITQ, TKG-Thinker ( $\text{SFT}+\text{GRPO}$ , Qwen2.5-7B backend) achieves Hits@1 of 0.855, surpassing the strongest prior baseline (PoK, 0.779) by +7.6% absolute.
Complexity Handling: TKG-Thinker delivers up to +29.7% absolute Hits@1 improvements over prior systems on multiple-step or “Complex” queries.
Generalization: When evaluated out-of-domain on Timeline-KGQA variants, TKG-Thinker outperforms RTQA and retrieval-augmented generation (RAG) baselines on “Medium” and “Complex” queries.
Ablation Effects: Removing the SFT stage results in a 26.4% drop; removing Plan actions reduces multi-step performance by 5.9%; removing temporal retrievers collapses accuracy by nearly 40%, indicating the necessity of temporal grounding.

Model	MULTITQ Hits@1	CronQuestions Hits@1	Multi-step Gain
PoK (best prior)	0.779	0.863	baseline
TKG-Thinker	0.855	0.893	up to +29.7%

5. Comparison with Recursive Decomposition Approaches

The TKG-Thinker pattern can be contrasted with “recursive thinking” systems such as RTQA, which decompose the query into a tree structure, solve atomic subquestions via LLM+Retriever, and combine candidate answers via multi-path aggregation (Gong et al., 4 Sep 2025):

Planning Granularity: RTQA relies on a decomposition module using prompt-specialized LLM calls to build a tree of subquestions. TKG-Thinker’s planning is learned via SFT and refined through RL over discrete Plan actions and real-time tool feedback.
Answer Aggregation: RTQA employs multi-path aggregation for robustness against substep failures; TKG-Thinker’s self-verifying plan and tool use reduce error propagation by explicit verification in the interaction loop.
Training: RTQA is training-free (plug-and-play over LLMs and retrievers), while TKG-Thinker requires SFT and RL optimization but yields stronger empirical gains, especially on multi-hop/complex queries.

Feature	RTQA	TKG-Thinker
Decomposition	Prompt-based, tree-structured	SFT-learned, stepwise plan
Aggregation	Multi-path answer fusion	Interaction + evidence check
Training	Zero-shot (prompt only)	SFT + RL (PPO/GRPO)
Temporal Retriever	Yes	Specialized tool calls
Adaptivity	Static tree	Dynamic multi-turn agent

6. Significance, Limitations, and Future Directions

TKG-Thinker demonstrates the utility of casting TKGQA as an agentic, RL-optimized interaction, in which planning, evidence retrieval, and verification are explicitly coordinated. Results show substantial improvements in both in-domain and generalization settings, notably in the hardest temporal reasoning categories.

Identified limitations include constraints on horizon length (typically $<8$ steps), reward sparsity (primarily terminal EM and protocol indicators), and adaptation to truly open-world or unseen event types. Open problems include:

Richer intermediate rewards via LLM-based judges (beyond binary EM).
Scaling to deep or continuous-time reasoning tasks.
Multi-agent extensions, e.g., separating planning and retrieval agents.
Application to other compositional knowledge domains beyond temporal graphs.

A plausible implication is that finer-grained RL-driven stepwise reasoning with adaptive tool use may generalize to KGQA tasks in domains involving spatial, causal, or multi-attribute reasoning, provided suitable toolkits and reward definitions are supplied (Jiang et al., 5 Feb 2026, Gong et al., 4 Sep 2025).

Markdown Upgrade to Chat

References (2)

TKG-Thinker: Towards Dynamic Reasoning over Temporal Knowledge Graphs via Agentic Reinforcement Learning (2026)

RTQA : Recursive Thinking for Complex Temporal Knowledge Graph Question Answering with Large Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TKG-Thinker.

TKG-Thinker: Temporal KGQA Agent Architecture

1. TKGQA Challenges and Motivation

2. Agentic Architecture: Dual-Training and Tool-Augmentation

3. Mathematical Formulation of Multi-Turn RL Optimization

4. Experimental Results and Empirical Analysis

5. Comparison with Recursive Decomposition Approaches

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

TKG-Thinker: Temporal KGQA Agent Architecture

1. TKGQA Challenges and Motivation

2. Agentic Architecture: Dual-Training and Tool-Augmentation

3. Mathematical Formulation of Multi-Turn RL Optimization

4. Experimental Results and Empirical Analysis

5. Comparison with Recursive Decomposition Approaches

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research