Deep Linear Hawkes Process (DLHP)

Updated 16 March 2026

DLHP is a novel model that integrates linear Hawkes processes with deep state-space models using stochastic jump differential equations for effective event sequence modeling.
It employs parallel scan recurrences to achieve linear computational scaling and efficient GPU/TPU implementations, reducing computational bottlenecks.
DLHP demonstrates state-of-the-art empirical performance on diverse real-world datasets by accurately capturing rich mark and temporal dependencies.

The Deep Linear Hawkes Process (DLHP) is a class of marked temporal point process (MTPP) models that synthesizes principles from linear Hawkes processes (LHPs) and modern deep state-space models (SSMs), enabling efficient, expressive, and parallelizable modeling of event sequences with rich mark and temporal dependencies. DLHPs modify linear differential equations in deep SSMs into stochastic jump differential equations, producing a process that generalizes classical LHP dynamics and facilitates the use of deep architectures, parallel scan algorithms, and scalable GPU/TPU implementations. This framework addresses representational and computational bottlenecks in both traditional Hawkes processes and contemporary neural MTPP models, establishing state-of-the-art empirical results and computational scaling across diverse real-world event-sequence datasets (Chang et al., 2024).

1. Motivation and Foundations

Classical linear Hawkes processes (LHPs) model event excitations using parametric kernels (typically exponentials), comprising a static background rate $\nu$ and kernel-determined self- or cross-excitation patterns. These assumptions limit flexibility in capturing complex historical and nonstationary sequence behaviors, and marked extensions require $K \times K$ matrix kernels, further constraining scalability and expressivity.

Neural MTPP models, such as RNN-based (e.g., RMTPP, Neural Hawkes) or attention-based (e.g., SAHP, Transformer Hawkes) architectures, have expanded modeling power but at cost: RNNs are strictly sequential in time, inhibiting parallelization for long-event sequences, while attention-based models suffer $O(N^2)$ cost in sequence length $N$ due to quadratic scaling of self-attention.

Modern deep SSMs enable continuous-time linear recurrences discretized to permit parallel scan algorithms with $O(N)$ total work and $O(\log N)$ depth. By combining these dynamics with Hawkes-style stochastic jump terms (“jump SDEs”), DLHP achieves long-range memory, dynamic excitation, and highly efficient computation.

2. Mathematical Formulation

DLHP operates as a stack of “Latent Linear Hawkes” (LLH) layers. For one LLH layer, with continuous hidden state $x(t)\in\mathbb{R}^P$ , input signal $u(t)\in\mathbb{R}^H$ , event-counting process $N(t)$ (for $K$ marks), and mark embeddings $\alpha \in \mathbb{R}^{R \times K}$ , the model is specified as:

State evolution (stochastic jump differential equation):

$d x(t) = A x(t^-) dt + B u(t^-) dt + E \alpha\, dN(t)$

$A\in\mathbb{R}^{P\times P}$ encodes linear state recurrence, $B\in\mathbb{R}^{P\times H}$ maps input to state, $E\in\mathbb{R}^{P\times R}$ projects mark embeddings, and $dN(t)$ is a one-hot jump at event times.

Output embedding:

$y(t) = C x(t) + D u(t)$

$C\in\mathbb{R}^{H\times P}$ , $D\in\mathbb{R}^{H\times H}$ .

Intensity for each mark:

$\lambda^k(t) = f([y(t^-)]_k),\quad k=1,\dots,K$

$f(\cdot)$ is a scaled softplus after an affine projection, enforcing non-negativity.

Classical LHP dynamics are recovered by special choices of these matrices.

3. Discretization and Parallel Recurrence

DLHP must compute states and intensities at irregular event times $t_1 < t_2 < \dots < t_N$ . Under the zero-order hold (ZOH) assumption (input held constant across intervals), the update is:

State update:

$x(t_i^-) = \Phi(\Delta t_i)\, x(t_{i-1}^+) + [\Phi(\Delta t_i) - I] \Lambda^{-1} \Lambda B\, u(t_{i-1}^+)$

where $\Delta t_i = t_i - t_{i-1}$ , $A = V\Lambda V^{-1}$ , $\Phi(\Delta t) = e^{\Lambda \Delta t}$ .

Event jump:

$x(t_i^+) = x(t_i^-) + E \alpha_{:,k_i}$

Compact parallel scan update:

$h_i = \Phi(\Delta t_i) h_{i-1} + K(\Delta t_i) u_{i-1} + E \alpha_{k_i}$

$\lambda_i = f(C h_i + d)$

where $K(\Delta t) = (\Phi(\Delta t) - I) \Lambda^{-1}$ .

This linear structure supports parallel scan algorithms with $O(N)$ total work and $O(\log N)$ depth, enabling efficient batched execution.

4. Computational Complexity and Parallelism

DLHP achieves a favorable computational profile compared to competing methods:

Total work per layer: $O(NP^2)$ ( $O(N P \log P)$ if $A$ is diagonalizable).
Parallel depth: $O(\log N)$ with parallel scan.
Comparison:
- RNN-based MTPPs: $O(NP^2)$ work, depth $O(N)$ (sequential).
- Transformer-based: $O(N^2 H)$ work (quadratic with sequence length).
- DLHP supports full parallelism in $N$ at linear scaling beyond $N\gg 1000$ .

Modern SSM toolkits (e.g., S5/Mamba in PyTorch, JAX) provide efficient eigenvalue decomposition, exponentiation routines, and kernel fusion for GPU/TPU acceleration.

5. Model Training and Inference

DLHP is trained by maximizing the MTPP log-likelihood:

$\mathcal{L} = \sum_{i=1}^N \log \lambda^{k_i}(t_i) - \int_0^T \sum_{k=1}^K \lambda^k(s)\,ds$

The intractable integral is estimated via Monte Carlo (sampling $M$ uniform times per interval), accumulating contributions $\sum_{i,m} \lambda(s_{i,m}) \Delta t_i/M$ .

Common regularization and optimization strategies:

Dropout in residual streams
Weight decay on all weight matrices
Gradient norm clipping
Adam/AdamW optimizer with learning rate warm-up and cosine decay
Mini-batch training by packing multiple padded sequences

For inference, parallel scan recurrences enable efficient evaluation even for very long event sequences.

6. Empirical Performance and Benchmarks

DLHP was benchmarked on eight real-world datasets: Amazon, Retweet, Taxi, Taobao, StackOverflow, Last.fm, MIMIC-II, EHRSHOT, covering sequence counts from 1.4K to 7.5K, mark sizes $K$ from $3$ to $668$, and sequence lengths up to 4K events.

Metric	DLHP Result	Context
Log-likelihood	Outperforms all baselines on every dataset (38% higher likelihood per dataset on average)	Per-event nats/event, main metric
Inter-event times	Most of the performance gain attributed here	Compared to mark prediction
Next-mark accuracy	Competitive or superior to alternatives	Top-1/Top-10 (EHRSHOT)
Next-time RMSE	Competitive or superior to alternatives
Runtime scaling	Near-constant wall-time for $N\lesssim 5$ K, linear scaling for larger $N$ .	JAX-JIT implementation; model scales where Transformers become infeasible

Ablation studies showed that input-dependent dynamics (varying $\Lambda$ by $u(t_i)$ ) provides small, consistent improvements; choice of backward vs forward ZOH has minimal effect.

7. Practical Considerations and Extensions

Dimension choices: Hidden state $P$ and output $H$ typically range from $32$ to $256$; layer counts $L=1$ to $4$; jump rank $R$ similar to $H$ (can reduce for speed).
Stabilization: Eigenvalues of $A$ are constrained to have negative real parts.
Kernel design: The exponential decay family $e^{\Lambda \Delta t}$ forms a learned “kernel bank”; input-dependent $\Lambda$ allows dynamic forgetting.
Extensions:
- Nonlinear kernels via state- or time-dependent SDE nonlinearity
- Hybrid memory/attention (short attention + LLH for very long contexts)
- Structured $A$ (block-diagonal, low-rank plus diagonal)
- Variational dropout for state uncertainty
- Intensity-free versions using flow-based time-change for explicit sampling

DLHP inaugurates a new direction in MTPP modeling, combining the analytic properties of Hawkes processes with the computational and representational benefits of deep SSMs, stacked nonlinear layers, and parallelizable recurrences, achieving top accuracy and linear scaling (Chang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Deep Linear Hawkes Processes (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Linear Hawkes Process (DLHP).