Transformer Hawkes Process (THP)
- THP is a neural temporal point process model that leverages Transformer self-attention to parameterize Hawkes process intensities, surpassing traditional RNN-based methods.
- The model uses absolute sinusoidal time encoding, though its sensitivity to timestamp shifts has led to RoTHP, which incorporates rotary positional embeddings for translation invariance.
- Empirical evaluations across finance, healthcare, and social networks show that THP achieves superior predictive performance and scalability in modeling asynchronous event sequences.
A Transformer Hawkes Process (THP) is a neural temporal point process model that leverages Transformer self-attention to parameterize the conditional intensity of Hawkes processes for asynchronous event sequence modeling. THP and its variants, including RoTHP, have been developed to capture complex temporal and inter-event dependencies in domains such as finance, healthcare, and social networks, surpassing traditional RNN-based temporal point processes in both predictive performance and scalability.
1. Mathematical Foundation: Marked Hawkes Processes
A temporal point process on a timeline describes a stochastic sequence of event times with possible type labels ("marks"). The defining quantity is the conditional intensity function:
where is the event history. In a marked (multivariate) Hawkes process, the type-specific intensity for event type is
with the exogenous base rate, and the mutual/self-excitation kernel from previous events of type to the current type . The negative log-likelihood over a sequence is
2. Transformer Hawkes Process: Architecture and Parameterization
THP (\cite{zuo2020transformer} (Zuo et al., 2020)) replaces the recurrent backbone of classical neural Hawkes processes with a multi-layered Transformer encoder. Its input representation for each event is
- An event type embedding ,
- An absolute sinusoidal time encoding:
The sum forms the input sequence. In each Transformer layer, standard masked self-attention is performed:
yielding final hidden representations .
Conditional intensity for type between events and :
where is typically softplus or ReLU to ensure non-negativity.
THP is optimized by maximizing log-likelihood, with the integral term typically approximated via Monte Carlo or trapezoidal numerical integration. Training employs stochastic gradient descent (Adam), leveraging the parallelization and scalability of Transformer attention.
THP empirically shows superior log-likelihood and prediction accuracy versus RNN-based (RMTPP, Neural Hawkes) and alternative self-attentive models (SAHP) across datasets including financial transactions, healthcare (MIMIC-II), social (Retweets, StackOverflow), and structured settings (911 calls, earthquakes).
3. Sequence Prediction and Translation Sensitivity: THP Limitations
THP's use of absolute sinusoidal embeddings introduces sensitivity to timestamp translation and noise. Specifically:
- The negative log-likelihood depends only on time intervals , but is not translation invariant: global shifts result in unseen embedding values during testing.
- In sequence prediction, e.g., predicting the future suffix after training on a prefix, absolute-time encodings for test-set timestamps can fall outside the training domain, causing generalization failures.
- Empirical studies show that global time shifts or additive timestamp jitter degrades model performance (up to 0.1 nats loss under shift, and 0.9/0.10 likelihood/RMSE under noise for THP).
4. Relative Rotary Position Embedding: RoTHP Resolution and Mechanism
RoTHP (Gao et al., 11 May 2024) addresses these limitations by introducing rotary temporal positional encoding (RoTPE) adapted from RoPE [su2024roformer]. The key algorithmic innovations are:
- For each timestamp , form a block-diagonal rotation matrix:
where .
- Self-attention is performed after applying rotary shifts:
so that
— making the score a pure function of the time difference, not the absolute timestamps.
- All attention is "relative in time", achieving translation invariance.
The rest of the Transformer and intensity parameterizations follow THP, but the overall RoTHP computation is invariant under timestamp shift and robust to local noise.
5. Theoretical Properties: Translation Invariance and Sequence Flexibility
RoTHP formalizes the following invariance properties:
- Translation invariance: For any global shift , total loss .
- Sequence-prediction flexibility: And because payoffs and attention depend only on relative intervals, RoTHP can be trained on offset-normalized prefixes and deployed on arbitrary successor segments without retraining or mismatch.
These properties are provably lacking in vanilla THP.
6. Empirical Evaluation and Comparative Performance
RoTHP's robustness and performance are validated on diverse real and synthetic datasets:
- Datasets: Synthetic Hawkes (length 20–100, 5 types), financial transactions (Buy/Sell; avg ~2000), MIMIC-II (ICU visits; ≤50), StackOverflow (22 types; avg 72), MemeTrack (42K memes), Retweet cascades.
- Baselines: RMTPP, Neural Hawkes, SAHP, THP.
- Metrics: Log-likelihood, event-type accuracy, time RMSE.
A table summarizing Financial / StackOverflow / Synthetic results:
| Model | Log-likelihood | Accuracy | RMSE |
|---|---|---|---|
| RMTPP | –3.89 / –2.60 | 61.95 / 45.9 | 1.56 / 9.78 |
| Neural Hawkes | –3.60 / –2.55 | 62.20 / 46.3 | 1.56 / 9.83 |
| SAHP | – / –1.86 /0.59 | – / – /38.13 | – / – / 5.57 |
| THP | –1.11 / –0.039 | 62.23 /46.4 | 0.93 / 4.99 |
| RoTHP | +1.076/0.389/1.01 | 62.26/46.33/38.13 | 0.60/1.33/2.29 |
- Under timestamp shift, RoTHP log-likelihood remains unchanged, unlike THP.
- Under jitter, RoTHP sees only minor degradation (likelihood/RMSE worsened by ~0.6/0.05) versus THP's more substantial drop.
- In sequence-prediction regimes (prefix train, suffix test), RoTHP substantially outperforms THP by ~2 nats log-likelihood, lower RMSE, and higher classification accuracy.
7. Significance, Practical Impact, and Future Directions
RoTHP provides the following advantages:
- Explicit translation invariance and robustness to timestamp noise;
- Improved generalization for sequence prediction and extrapolation across event horizons;
- Consistently state-of-the-art empirical results on both synthetic and real datasets;
- Directly addresses shortcomings of absolute positional encoding pervasive in Transformer point process models.
A plausible implication is that future temporal point process architectures should favor relative-time or translation-invariant encodings, especially for sequence analysis where extrapolation, noise, or time re-indexing is inherent.
RoTHP offers a blueprint for deploying Hawkes-process–based models in practical machine learning pipelines for financial, medical, social, and web-scale event data, ensuring stable predictive power and reliability across variable timestamp domains. Its empirical superiority and theoretical guarantees under global transformations suggest its adoption as a default module in sequence modeling tasks requiring time-shift invariance.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free