Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Linear Hawkes Process (DLHP)

Updated 16 March 2026
  • DLHP is a novel model that integrates linear Hawkes processes with deep state-space models using stochastic jump differential equations for effective event sequence modeling.
  • It employs parallel scan recurrences to achieve linear computational scaling and efficient GPU/TPU implementations, reducing computational bottlenecks.
  • DLHP demonstrates state-of-the-art empirical performance on diverse real-world datasets by accurately capturing rich mark and temporal dependencies.

The Deep Linear Hawkes Process (DLHP) is a class of marked temporal point process (MTPP) models that synthesizes principles from linear Hawkes processes (LHPs) and modern deep state-space models (SSMs), enabling efficient, expressive, and parallelizable modeling of event sequences with rich mark and temporal dependencies. DLHPs modify linear differential equations in deep SSMs into stochastic jump differential equations, producing a process that generalizes classical LHP dynamics and facilitates the use of deep architectures, parallel scan algorithms, and scalable GPU/TPU implementations. This framework addresses representational and computational bottlenecks in both traditional Hawkes processes and contemporary neural MTPP models, establishing state-of-the-art empirical results and computational scaling across diverse real-world event-sequence datasets (Chang et al., 2024).

1. Motivation and Foundations

Classical linear Hawkes processes (LHPs) model event excitations using parametric kernels (typically exponentials), comprising a static background rate ν\nu and kernel-determined self- or cross-excitation patterns. These assumptions limit flexibility in capturing complex historical and nonstationary sequence behaviors, and marked extensions require K×KK \times K matrix kernels, further constraining scalability and expressivity.

Neural MTPP models, such as RNN-based (e.g., RMTPP, Neural Hawkes) or attention-based (e.g., SAHP, Transformer Hawkes) architectures, have expanded modeling power but at cost: RNNs are strictly sequential in time, inhibiting parallelization for long-event sequences, while attention-based models suffer O(N2)O(N^2) cost in sequence length NN due to quadratic scaling of self-attention.

Modern deep SSMs enable continuous-time linear recurrences discretized to permit parallel scan algorithms with O(N)O(N) total work and O(logN)O(\log N) depth. By combining these dynamics with Hawkes-style stochastic jump terms (“jump SDEs”), DLHP achieves long-range memory, dynamic excitation, and highly efficient computation.

2. Mathematical Formulation

DLHP operates as a stack of “Latent Linear Hawkes” (LLH) layers. For one LLH layer, with continuous hidden state x(t)RPx(t)\in\mathbb{R}^P, input signal u(t)RHu(t)\in\mathbb{R}^H, event-counting process N(t)N(t) (for KK marks), and mark embeddings αRR×K\alpha \in \mathbb{R}^{R \times K}, the model is specified as:

  • State evolution (stochastic jump differential equation):

dx(t)=Ax(t)dt+Bu(t)dt+EαdN(t)d x(t) = A x(t^-) dt + B u(t^-) dt + E \alpha\, dN(t)

ARP×PA\in\mathbb{R}^{P\times P} encodes linear state recurrence, BRP×HB\in\mathbb{R}^{P\times H} maps input to state, ERP×RE\in\mathbb{R}^{P\times R} projects mark embeddings, and dN(t)dN(t) is a one-hot jump at event times.

  • Output embedding:

y(t)=Cx(t)+Du(t)y(t) = C x(t) + D u(t)

CRH×PC\in\mathbb{R}^{H\times P}, DRH×HD\in\mathbb{R}^{H\times H}.

  • Intensity for each mark:

λk(t)=f([y(t)]k),k=1,,K\lambda^k(t) = f([y(t^-)]_k),\quad k=1,\dots,K

f()f(\cdot) is a scaled softplus after an affine projection, enforcing non-negativity.

Classical LHP dynamics are recovered by special choices of these matrices.

3. Discretization and Parallel Recurrence

DLHP must compute states and intensities at irregular event times t1<t2<<tNt_1 < t_2 < \dots < t_N. Under the zero-order hold (ZOH) assumption (input held constant across intervals), the update is:

  • State update:

x(ti)=Φ(Δti)x(ti1+)+[Φ(Δti)I]Λ1ΛBu(ti1+)x(t_i^-) = \Phi(\Delta t_i)\, x(t_{i-1}^+) + [\Phi(\Delta t_i) - I] \Lambda^{-1} \Lambda B\, u(t_{i-1}^+)

where Δti=titi1\Delta t_i = t_i - t_{i-1}, A=VΛV1A = V\Lambda V^{-1}, Φ(Δt)=eΛΔt\Phi(\Delta t) = e^{\Lambda \Delta t}.

  • Event jump:

x(ti+)=x(ti)+Eα:,kix(t_i^+) = x(t_i^-) + E \alpha_{:,k_i}

  • Compact parallel scan update:

hi=Φ(Δti)hi1+K(Δti)ui1+Eαkih_i = \Phi(\Delta t_i) h_{i-1} + K(\Delta t_i) u_{i-1} + E \alpha_{k_i}

λi=f(Chi+d)\lambda_i = f(C h_i + d)

where K(Δt)=(Φ(Δt)I)Λ1K(\Delta t) = (\Phi(\Delta t) - I) \Lambda^{-1}.

This linear structure supports parallel scan algorithms with O(N)O(N) total work and O(logN)O(\log N) depth, enabling efficient batched execution.

4. Computational Complexity and Parallelism

DLHP achieves a favorable computational profile compared to competing methods:

  • Total work per layer: O(NP2)O(NP^2) (O(NPlogP)O(N P \log P) if AA is diagonalizable).
  • Parallel depth: O(logN)O(\log N) with parallel scan.
  • Comparison:
    • RNN-based MTPPs: O(NP2)O(NP^2) work, depth O(N)O(N) (sequential).
    • Transformer-based: O(N2H)O(N^2 H) work (quadratic with sequence length).
    • DLHP supports full parallelism in NN at linear scaling beyond N1000N\gg 1000.

Modern SSM toolkits (e.g., S5/Mamba in PyTorch, JAX) provide efficient eigenvalue decomposition, exponentiation routines, and kernel fusion for GPU/TPU acceleration.

5. Model Training and Inference

DLHP is trained by maximizing the MTPP log-likelihood:

L=i=1Nlogλki(ti)0Tk=1Kλk(s)ds\mathcal{L} = \sum_{i=1}^N \log \lambda^{k_i}(t_i) - \int_0^T \sum_{k=1}^K \lambda^k(s)\,ds

The intractable integral is estimated via Monte Carlo (sampling MM uniform times per interval), accumulating contributions i,mλ(si,m)Δti/M\sum_{i,m} \lambda(s_{i,m}) \Delta t_i/M.

Common regularization and optimization strategies:

For inference, parallel scan recurrences enable efficient evaluation even for very long event sequences.

6. Empirical Performance and Benchmarks

DLHP was benchmarked on eight real-world datasets: Amazon, Retweet, Taxi, Taobao, StackOverflow, Last.fm, MIMIC-II, EHRSHOT, covering sequence counts from 1.4K to 7.5K, mark sizes KK from $3$ to $668$, and sequence lengths up to 4K events.

Metric DLHP Result Context
Log-likelihood Outperforms all baselines on every dataset (38% higher likelihood per dataset on average) Per-event nats/event, main metric
Inter-event times Most of the performance gain attributed here Compared to mark prediction
Next-mark accuracy Competitive or superior to alternatives Top-1/Top-10 (EHRSHOT)
Next-time RMSE Competitive or superior to alternatives
Runtime scaling Near-constant wall-time for N5N\lesssim 5K, linear scaling for larger NN. JAX-JIT implementation; model scales where Transformers become infeasible

Ablation studies showed that input-dependent dynamics (varying Λ\Lambda by u(ti)u(t_i)) provides small, consistent improvements; choice of backward vs forward ZOH has minimal effect.

7. Practical Considerations and Extensions

  • Dimension choices: Hidden state PP and output HH typically range from $32$ to $256$; layer counts L=1L=1 to $4$; jump rank RR similar to HH (can reduce for speed).
  • Stabilization: Eigenvalues of AA are constrained to have negative real parts.
  • Kernel design: The exponential decay family eΛΔte^{\Lambda \Delta t} forms a learned “kernel bank”; input-dependent Λ\Lambda allows dynamic forgetting.
  • Extensions:
    • Nonlinear kernels via state- or time-dependent SDE nonlinearity
    • Hybrid memory/attention (short attention + LLH for very long contexts)
    • Structured AA (block-diagonal, low-rank plus diagonal)
    • Variational dropout for state uncertainty
    • Intensity-free versions using flow-based time-change for explicit sampling

DLHP inaugurates a new direction in MTPP modeling, combining the analytic properties of Hawkes processes with the computational and representational benefits of deep SSMs, stacked nonlinear layers, and parallelizable recurrences, achieving top accuracy and linear scaling (Chang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Linear Hawkes Process (DLHP).