Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Transformer Hawkes Process (THP)

Updated 11 November 2025
  • THP is a neural temporal point process model that leverages Transformer self-attention to parameterize Hawkes process intensities, surpassing traditional RNN-based methods.
  • The model uses absolute sinusoidal time encoding, though its sensitivity to timestamp shifts has led to RoTHP, which incorporates rotary positional embeddings for translation invariance.
  • Empirical evaluations across finance, healthcare, and social networks show that THP achieves superior predictive performance and scalability in modeling asynchronous event sequences.

A Transformer Hawkes Process (THP) is a neural temporal point process model that leverages Transformer self-attention to parameterize the conditional intensity of Hawkes processes for asynchronous event sequence modeling. THP and its variants, including RoTHP, have been developed to capture complex temporal and inter-event dependencies in domains such as finance, healthcare, and social networks, surpassing traditional RNN-based temporal point processes in both predictive performance and scalability.

1. Mathematical Foundation: Marked Hawkes Processes

A temporal point process on a timeline (0,T](0, T] describes a stochastic sequence of event times with possible type labels ("marks"). The defining quantity is the conditional intensity function:

λ(t)dt=Pr{event in [t,t+dt)Ht}=E[dN(t)Ht],\lambda^*(t) \,dt = \Pr\{\text{event in } [t, t+dt) \mid \mathcal{H}_t\} = \mathbb{E}[dN(t) \mid \mathcal{H}_t],

where Ht={ti:ti<t}\mathcal{H}_t = \{t_i: t_i < t\} is the event history. In a marked (multivariate) Hawkes process, the type-specific intensity for event type uu is

λu(t)=μu+ti<tϕu,ki(tti),\lambda^*_u (t) = \mu_u + \sum_{t_i < t} \phi_{u, k_i}(t - t_i),

with μu\mu_u the exogenous base rate, and ϕu,v()\phi_{u,v}(\cdot) the mutual/self-excitation kernel from previous events of type vv to the current type uu. The negative log-likelihood over a sequence {(ti,ki)}i=1n\{(t_i, k_i)\}_{i=1}^n is

L=i=1nlogλki(ti)0Tuλu(t)dt.\mathcal{L} = \sum_{i=1}^n \log \lambda^*_{k_i}(t_i) - \int_0^T \sum_u \lambda^*_u (t) \,dt.

2. Transformer Hawkes Process: Architecture and Parameterization

THP (\cite{zuo2020transformer} (Zuo et al., 2020)) replaces the recurrent backbone of classical neural Hawkes processes with a multi-layered Transformer encoder. Its input representation for each event (ti,ki)(t_i, k_i) is

  • An event type embedding ekiRD\mathbf{e}_{k_i} \in \mathbb{R}^D,
  • An absolute sinusoidal time encoding:

[x(ti)]2j1=cos(ti10000(2j2)/D),[x(ti)]2j=sin(ti10000(2j1)/D).[\mathbf{x}(t_i)]_{2j-1} = \cos\left(\frac{t_i}{10000^{(2j-2)/D}}\right), \quad [\mathbf{x}(t_i)]_{2j} = \sin\left(\frac{t_i}{10000^{(2j-1)/D}}\right).

The sum yi=eki+x(ti)\mathbf{y}_i = \mathbf{e}_{k_i} + \mathbf{x}(t_i) forms the input sequence. In each Transformer layer, standard masked self-attention is performed:

Q=YWQ,K=YWK,V=YWV,Q = YW^Q, \quad K = YW^K, \quad V = YW^V,

Aij=qiTkjd,Attn(Q,K,V)=Softmax(A)V,A_{ij} = \frac{q_i^T k_j}{\sqrt{d}}, \qquad \text{Attn}(Q, K, V) = \text{Softmax}(A) V,

yielding final hidden representations h(ti)\mathbf{h}(t_i).

Conditional intensity for type uu between events tjt_j and tj+1t_{j+1}:

λu(t)=f(αu(ttj)+wuTh(tj)+bu),\lambda_u (t) = f\left( \alpha_u (t - t_j) + \mathbf{w}_u^T \,\mathbf{h}(t_j) + b_u \right),

where ff is typically softplus or ReLU to ensure non-negativity.

THP is optimized by maximizing log-likelihood, with the integral term typically approximated via Monte Carlo or trapezoidal numerical integration. Training employs stochastic gradient descent (Adam), leveraging the parallelization and scalability of Transformer attention.

THP empirically shows superior log-likelihood and prediction accuracy versus RNN-based (RMTPP, Neural Hawkes) and alternative self-attentive models (SAHP) across datasets including financial transactions, healthcare (MIMIC-II), social (Retweets, StackOverflow), and structured settings (911 calls, earthquakes).

3. Sequence Prediction and Translation Sensitivity: THP Limitations

THP's use of absolute sinusoidal embeddings introduces sensitivity to timestamp translation and noise. Specifically:

  • The negative log-likelihood depends only on time intervals titjt_i - t_j, but x(ti)\mathbf{x}(t_i) is not translation invariant: global shifts titi+σt_i \mapsto t_i + \sigma result in unseen embedding values during testing.
  • In sequence prediction, e.g., predicting the future suffix after training on a prefix, absolute-time encodings for test-set timestamps can fall outside the training domain, causing generalization failures.
  • Empirical studies show that global time shifts or additive timestamp jitter degrades model performance (up to 0.1 nats loss under shift, and 0.9/0.10 likelihood/RMSE under noise for THP).

4. Relative Rotary Position Embedding: RoTHP Resolution and Mechanism

RoTHP (Gao et al., 11 May 2024) addresses these limitations by introducing rotary temporal positional encoding (RoTPE) adapted from RoPE [su2024roformer]. The key algorithmic innovations are:

  • For each timestamp tt, form a block-diagonal rotation matrix:

Rt=blockdiag((cos(θjt)sin(θjt) sin(θjt)cos(θjt)))j=1d/2,R_t = \text{blockdiag}\left( \begin{pmatrix}\cos(\theta_j t) & \sin(\theta_j t) \ -\sin(\theta_j t) & \cos(\theta_j t) \end{pmatrix} \right)_{j=1}^{d/2},

where θj=100002(j1)/d\theta_j = 10000^{-2(j-1)/d}.

  • Self-attention is performed after applying rotary shifts:

qiRtiqi,kjRtjkj,q_i \leftarrow R_{t_i} q_i, \quad k_j \leftarrow R_{t_j} k_j,

so that

qiTkj=qiTRtjtikjq_i^T k_j = q_i^T R_{t_j-t_i} \,k_j

— making the score a pure function of the time difference, not the absolute timestamps.

  • All attention is "relative in time", achieving translation invariance.

The rest of the Transformer and intensity parameterizations follow THP, but the overall RoTHP computation is invariant under timestamp shift and robust to local noise.

5. Theoretical Properties: Translation Invariance and Sequence Flexibility

RoTHP formalizes the following invariance properties:

  • Translation invariance: For any global shift σ\sigma, total loss L(S)=L(Sσ)\mathcal{L}(\mathcal{S}) = \mathcal{L}(\mathcal{S}_\sigma).
  • Sequence-prediction flexibility: And because payoffs and attention depend only on relative intervals, RoTHP can be trained on offset-normalized prefixes and deployed on arbitrary successor segments without retraining or mismatch.

These properties are provably lacking in vanilla THP.

6. Empirical Evaluation and Comparative Performance

RoTHP's robustness and performance are validated on diverse real and synthetic datasets:

  • Datasets: Synthetic Hawkes (length 20–100, 5 types), financial transactions (Buy/Sell; avg ~2000), MIMIC-II (ICU visits; ≤50), StackOverflow (22 types; avg 72), MemeTrack (42K memes), Retweet cascades.
  • Baselines: RMTPP, Neural Hawkes, SAHP, THP.
  • Metrics: Log-likelihood, event-type accuracy, time RMSE.

A table summarizing Financial / StackOverflow / Synthetic results:

Model Log-likelihood Accuracy RMSE
RMTPP –3.89 / –2.60 61.95 / 45.9 1.56 / 9.78
Neural Hawkes –3.60 / –2.55 62.20 / 46.3 1.56 / 9.83
SAHP – / –1.86 /0.59 – / – /38.13 – / – / 5.57
THP –1.11 / –0.039 62.23 /46.4 0.93 / 4.99
RoTHP +1.076/0.389/1.01 62.26/46.33/38.13 0.60/1.33/2.29
  • Under timestamp shift, RoTHP log-likelihood remains unchanged, unlike THP.
  • Under jitter, RoTHP sees only minor degradation (likelihood/RMSE worsened by ~0.6/0.05) versus THP's more substantial drop.
  • In sequence-prediction regimes (prefix train, suffix test), RoTHP substantially outperforms THP by ~2 nats log-likelihood, lower RMSE, and higher classification accuracy.

7. Significance, Practical Impact, and Future Directions

RoTHP provides the following advantages:

  • Explicit translation invariance and robustness to timestamp noise;
  • Improved generalization for sequence prediction and extrapolation across event horizons;
  • Consistently state-of-the-art empirical results on both synthetic and real datasets;
  • Directly addresses shortcomings of absolute positional encoding pervasive in Transformer point process models.

A plausible implication is that future temporal point process architectures should favor relative-time or translation-invariant encodings, especially for sequence analysis where extrapolation, noise, or time re-indexing is inherent.

RoTHP offers a blueprint for deploying Hawkes-process–based models in practical machine learning pipelines for financial, medical, social, and web-scale event data, ensuring stable predictive power and reliability across variable timestamp domains. Its empirical superiority and theoretical guarantees under global transformations suggest its adoption as a default module in sequence modeling tasks requiring time-shift invariance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Transformer Hawkes Process (THP).