Deep Linear Hawkes Process (DLHP)
- DLHP is a novel model that integrates linear Hawkes processes with deep state-space models using stochastic jump differential equations for effective event sequence modeling.
- It employs parallel scan recurrences to achieve linear computational scaling and efficient GPU/TPU implementations, reducing computational bottlenecks.
- DLHP demonstrates state-of-the-art empirical performance on diverse real-world datasets by accurately capturing rich mark and temporal dependencies.
The Deep Linear Hawkes Process (DLHP) is a class of marked temporal point process (MTPP) models that synthesizes principles from linear Hawkes processes (LHPs) and modern deep state-space models (SSMs), enabling efficient, expressive, and parallelizable modeling of event sequences with rich mark and temporal dependencies. DLHPs modify linear differential equations in deep SSMs into stochastic jump differential equations, producing a process that generalizes classical LHP dynamics and facilitates the use of deep architectures, parallel scan algorithms, and scalable GPU/TPU implementations. This framework addresses representational and computational bottlenecks in both traditional Hawkes processes and contemporary neural MTPP models, establishing state-of-the-art empirical results and computational scaling across diverse real-world event-sequence datasets (Chang et al., 2024).
1. Motivation and Foundations
Classical linear Hawkes processes (LHPs) model event excitations using parametric kernels (typically exponentials), comprising a static background rate and kernel-determined self- or cross-excitation patterns. These assumptions limit flexibility in capturing complex historical and nonstationary sequence behaviors, and marked extensions require matrix kernels, further constraining scalability and expressivity.
Neural MTPP models, such as RNN-based (e.g., RMTPP, Neural Hawkes) or attention-based (e.g., SAHP, Transformer Hawkes) architectures, have expanded modeling power but at cost: RNNs are strictly sequential in time, inhibiting parallelization for long-event sequences, while attention-based models suffer cost in sequence length due to quadratic scaling of self-attention.
Modern deep SSMs enable continuous-time linear recurrences discretized to permit parallel scan algorithms with total work and depth. By combining these dynamics with Hawkes-style stochastic jump terms (“jump SDEs”), DLHP achieves long-range memory, dynamic excitation, and highly efficient computation.
2. Mathematical Formulation
DLHP operates as a stack of “Latent Linear Hawkes” (LLH) layers. For one LLH layer, with continuous hidden state , input signal , event-counting process (for marks), and mark embeddings , the model is specified as:
- State evolution (stochastic jump differential equation):
encodes linear state recurrence, maps input to state, projects mark embeddings, and is a one-hot jump at event times.
- Output embedding:
, .
- Intensity for each mark:
is a scaled softplus after an affine projection, enforcing non-negativity.
Classical LHP dynamics are recovered by special choices of these matrices.
3. Discretization and Parallel Recurrence
DLHP must compute states and intensities at irregular event times . Under the zero-order hold (ZOH) assumption (input held constant across intervals), the update is:
- State update:
where , , .
- Event jump:
- Compact parallel scan update:
where .
This linear structure supports parallel scan algorithms with total work and depth, enabling efficient batched execution.
4. Computational Complexity and Parallelism
DLHP achieves a favorable computational profile compared to competing methods:
- Total work per layer: ( if is diagonalizable).
- Parallel depth: with parallel scan.
- Comparison:
- RNN-based MTPPs: work, depth (sequential).
- Transformer-based: work (quadratic with sequence length).
- DLHP supports full parallelism in at linear scaling beyond .
Modern SSM toolkits (e.g., S5/Mamba in PyTorch, JAX) provide efficient eigenvalue decomposition, exponentiation routines, and kernel fusion for GPU/TPU acceleration.
5. Model Training and Inference
DLHP is trained by maximizing the MTPP log-likelihood:
The intractable integral is estimated via Monte Carlo (sampling uniform times per interval), accumulating contributions .
Common regularization and optimization strategies:
- Dropout in residual streams
- Weight decay on all weight matrices
- Gradient norm clipping
- Adam/AdamW optimizer with learning rate warm-up and cosine decay
- Mini-batch training by packing multiple padded sequences
For inference, parallel scan recurrences enable efficient evaluation even for very long event sequences.
6. Empirical Performance and Benchmarks
DLHP was benchmarked on eight real-world datasets: Amazon, Retweet, Taxi, Taobao, StackOverflow, Last.fm, MIMIC-II, EHRSHOT, covering sequence counts from 1.4K to 7.5K, mark sizes from $3$ to $668$, and sequence lengths up to 4K events.
| Metric | DLHP Result | Context |
|---|---|---|
| Log-likelihood | Outperforms all baselines on every dataset (38% higher likelihood per dataset on average) | Per-event nats/event, main metric |
| Inter-event times | Most of the performance gain attributed here | Compared to mark prediction |
| Next-mark accuracy | Competitive or superior to alternatives | Top-1/Top-10 (EHRSHOT) |
| Next-time RMSE | Competitive or superior to alternatives | |
| Runtime scaling | Near-constant wall-time for K, linear scaling for larger . | JAX-JIT implementation; model scales where Transformers become infeasible |
Ablation studies showed that input-dependent dynamics (varying by ) provides small, consistent improvements; choice of backward vs forward ZOH has minimal effect.
7. Practical Considerations and Extensions
- Dimension choices: Hidden state and output typically range from $32$ to $256$; layer counts to $4$; jump rank similar to (can reduce for speed).
- Stabilization: Eigenvalues of are constrained to have negative real parts.
- Kernel design: The exponential decay family forms a learned “kernel bank”; input-dependent allows dynamic forgetting.
- Extensions:
- Nonlinear kernels via state- or time-dependent SDE nonlinearity
- Hybrid memory/attention (short attention + LLH for very long contexts)
- Structured (block-diagonal, low-rank plus diagonal)
- Variational dropout for state uncertainty
- Intensity-free versions using flow-based time-change for explicit sampling
DLHP inaugurates a new direction in MTPP modeling, combining the analytic properties of Hawkes processes with the computational and representational benefits of deep SSMs, stacked nonlinear layers, and parallelizable recurrences, achieving top accuracy and linear scaling (Chang et al., 2024).