Transformer Hawkes Process (THP)

Updated 11 November 2025

THP is a neural temporal point process model that leverages Transformer self-attention to parameterize Hawkes process intensities, surpassing traditional RNN-based methods.
The model uses absolute sinusoidal time encoding, though its sensitivity to timestamp shifts has led to RoTHP, which incorporates rotary positional embeddings for translation invariance.
Empirical evaluations across finance, healthcare, and social networks show that THP achieves superior predictive performance and scalability in modeling asynchronous event sequences.

A Transformer Hawkes Process (THP) is a neural temporal point process model that leverages Transformer self-attention to parameterize the conditional intensity of Hawkes processes for asynchronous event sequence modeling. THP and its variants, including RoTHP, have been developed to capture complex temporal and inter-event dependencies in domains such as finance, healthcare, and social networks, surpassing traditional RNN-based temporal point processes in both predictive performance and scalability.

1. Mathematical Foundation: Marked Hawkes Processes

A temporal point process on a timeline $(0, T]$ describes a stochastic sequence of event times with possible type labels ("marks"). The defining quantity is the conditional intensity function:

$\lambda^*(t) \,dt = \Pr\{\text{event in } [t, t+dt) \mid \mathcal{H}_t\} = \mathbb{E}[dN(t) \mid \mathcal{H}_t],$

where $\mathcal{H}_t = \{t_i: t_i < t\}$ is the event history. In a marked (multivariate) Hawkes process, the type-specific intensity for event type $u$ is

$\lambda^*_u (t) = \mu_u + \sum_{t_i < t} \phi_{u, k_i}(t - t_i),$

with $\mu_u$ the exogenous base rate, and $\phi_{u,v}(\cdot)$ the mutual/self-excitation kernel from previous events of type $v$ to the current type $u$ . The negative log-likelihood over a sequence $\{(t_i, k_i)\}_{i=1}^n$ is

$\mathcal{L} = \sum_{i=1}^n \log \lambda^*_{k_i}(t_i) - \int_0^T \sum_u \lambda^*_u (t) \,dt.$

2. Transformer Hawkes Process: Architecture and Parameterization

THP (\cite{zuo2020transformer} (Zuo et al., 2020)) replaces the recurrent backbone of classical neural Hawkes processes with a multi-layered Transformer encoder. Its input representation for each event $(t_i, k_i)$ is

An event type embedding $\mathbf{e}_{k_i} \in \mathbb{R}^D$ ,
An absolute sinusoidal time encoding:

$[\mathbf{x}(t_i)]_{2j-1} = \cos\left(\frac{t_i}{10000^{(2j-2)/D}}\right), \quad [\mathbf{x}(t_i)]_{2j} = \sin\left(\frac{t_i}{10000^{(2j-1)/D}}\right).$

The sum $\mathbf{y}_i = \mathbf{e}_{k_i} + \mathbf{x}(t_i)$ forms the input sequence. In each Transformer layer, standard masked self-attention is performed:

$Q = YW^Q, \quad K = YW^K, \quad V = YW^V,$

$A_{ij} = \frac{q_i^T k_j}{\sqrt{d}}, \qquad \text{Attn}(Q, K, V) = \text{Softmax}(A) V,$

yielding final hidden representations $\mathbf{h}(t_i)$ .

Conditional intensity for type $u$ between events $t_j$ and $t_{j+1}$ :

$\lambda_u (t) = f\left( \alpha_u (t - t_j) + \mathbf{w}_u^T \,\mathbf{h}(t_j) + b_u \right),$

where $f$ is typically softplus or ReLU to ensure non-negativity.

THP is optimized by maximizing log-likelihood, with the integral term typically approximated via Monte Carlo or trapezoidal numerical integration. Training employs stochastic gradient descent (Adam), leveraging the parallelization and scalability of Transformer attention.

THP empirically shows superior log-likelihood and prediction accuracy versus RNN-based (RMTPP, Neural Hawkes) and alternative self-attentive models (SAHP) across datasets including financial transactions, healthcare (MIMIC-II), social (Retweets, StackOverflow), and structured settings (911 calls, earthquakes).

3. Sequence Prediction and Translation Sensitivity: THP Limitations

THP's use of absolute sinusoidal embeddings introduces sensitivity to timestamp translation and noise. Specifically:

The negative log-likelihood depends only on time intervals $t_i - t_j$ , but $\mathbf{x}(t_i)$ is not translation invariant: global shifts $t_i \mapsto t_i + \sigma$ result in unseen embedding values during testing.
In sequence prediction, e.g., predicting the future suffix after training on a prefix, absolute-time encodings for test-set timestamps can fall outside the training domain, causing generalization failures.
Empirical studies show that global time shifts or additive timestamp jitter degrades model performance (up to 0.1 nats loss under shift, and 0.9/0.10 likelihood/RMSE under noise for THP).

4. Relative Rotary Position Embedding: RoTHP Resolution and Mechanism

RoTHP (Gao et al., 11 May 2024) addresses these limitations by introducing rotary temporal positional encoding (RoTPE) adapted from RoPE [su2024roformer]. The key algorithmic innovations are:

For each timestamp $t$ , form a block-diagonal rotation matrix:

$R_t = \text{blockdiag}\left( \begin{pmatrix}\cos(\theta_j t) & \sin(\theta_j t) \ -\sin(\theta_j t) & \cos(\theta_j t) \end{pmatrix} \right)_{j=1}^{d/2},$

where $\theta_j = 10000^{-2(j-1)/d}$ .

Self-attention is performed after applying rotary shifts:

$q_i \leftarrow R_{t_i} q_i, \quad k_j \leftarrow R_{t_j} k_j,$

so that

$q_i^T k_j = q_i^T R_{t_j-t_i} \,k_j$

— making the score a pure function of the time difference, not the absolute timestamps.

All attention is "relative in time", achieving translation invariance.

The rest of the Transformer and intensity parameterizations follow THP, but the overall RoTHP computation is invariant under timestamp shift and robust to local noise.

5. Theoretical Properties: Translation Invariance and Sequence Flexibility

RoTHP formalizes the following invariance properties:

Translation invariance: For any global shift $\sigma$ , total loss $\mathcal{L}(\mathcal{S}) = \mathcal{L}(\mathcal{S}_\sigma)$ .
Sequence-prediction flexibility: And because payoffs and attention depend only on relative intervals, RoTHP can be trained on offset-normalized prefixes and deployed on arbitrary successor segments without retraining or mismatch.

These properties are provably lacking in vanilla THP.

6. Empirical Evaluation and Comparative Performance

RoTHP's robustness and performance are validated on diverse real and synthetic datasets:

Datasets: Synthetic Hawkes (length 20–100, 5 types), financial transactions (Buy/Sell; avg ~2000), MIMIC-II (ICU visits; ≤50), StackOverflow (22 types; avg 72), MemeTrack (42K memes), Retweet cascades.
Baselines: RMTPP, Neural Hawkes, SAHP, THP.
Metrics: Log-likelihood, event-type accuracy, time RMSE.

A table summarizing Financial / StackOverflow / Synthetic results:

Model	Log-likelihood	Accuracy	RMSE
RMTPP	–3.89 / –2.60	61.95 / 45.9	1.56 / 9.78
Neural Hawkes	–3.60 / –2.55	62.20 / 46.3	1.56 / 9.83
SAHP	– / –1.86 /0.59	– / – /38.13	– / – / 5.57
THP	–1.11 / –0.039	62.23 /46.4	0.93 / 4.99
RoTHP	+1.076/0.389/1.01	62.26/46.33/38.13	0.60/1.33/2.29

Under timestamp shift, RoTHP log-likelihood remains unchanged, unlike THP.
Under jitter, RoTHP sees only minor degradation (likelihood/RMSE worsened by ~0.6/0.05) versus THP's more substantial drop.
In sequence-prediction regimes (prefix train, suffix test), RoTHP substantially outperforms THP by ~2 nats log-likelihood, lower RMSE, and higher classification accuracy.

7. Significance, Practical Impact, and Future Directions

RoTHP provides the following advantages:

Explicit translation invariance and robustness to timestamp noise;
Improved generalization for sequence prediction and extrapolation across event horizons;
Consistently state-of-the-art empirical results on both synthetic and real datasets;
Directly addresses shortcomings of absolute positional encoding pervasive in Transformer point process models.

A plausible implication is that future temporal point process architectures should favor relative-time or translation-invariant encodings, especially for sequence analysis where extrapolation, noise, or time re-indexing is inherent.

RoTHP offers a blueprint for deploying Hawkes-process–based models in practical machine learning pipelines for financial, medical, social, and web-scale event data, ensuring stable predictive power and reliability across variable timestamp domains. Its empirical superiority and theoretical guarantees under global transformations suggest its adoption as a default module in sequence modeling tasks requiring time-shift invariance.

PDF Markdown Chat (Pro)

References (2)

Transformer Hawkes Process (2020)

RoTHP: Rotary Position Embedding-based Transformer Hawkes Process (2024)

Follow Topic

Get notified by email when new papers are published related to Transformer Hawkes Process (THP).