Hyper Hawkes Process (HHP): Interpretable Event Modeling

Updated 9 November 2025

Hyper Hawkes Process (HHP) is a marked temporal point process model that extends classical Hawkes processes by leveraging a latent state and history-dependent hypernetwork.
It introduces a latent dimensional lifting mechanism that enables both higher model expressivity and effective dimensional compression to capture complex temporal dependencies.
The model offers transparent event-level interpretability through conditionally linear recurrences while achieving efficient parameter usage and high predictive performance.

The Hyper Hawkes Process (HHP) is a class of marked temporal point process (MTPP) models that simultaneously addresses the interpretability limitations of neural MTPPs and the rigidity of classical Hawkes processes. HHP achieves this by expanding the dynamics into a latent space and introducing a history-dependent hypernetwork, yielding models that are both highly expressive and amenable to rigorous, event-level interpretability. The model exhibits piecewise, conditionally linear recurrences in the latent state, enabling both transparent prediction mechanisms and high predictive performance characteristic of neural models.

1. Model Specification and Latent Dynamics

Let the event history be $\mathcal{H}_t = \{(t_i, k_i)\}_{i=1}^{N_t}$ with marks $k_i \in \{1, \dots, K\}$ . HHP models a $d$ -dimensional latent state $\mathbf{x}_t \in \mathbb{R}^d$ , whose time evolution determines the vector of event intensities $\boldsymbol\lambda_t = [\lambda_t^1, \ldots, \lambda_t^K]^\top$ . The coupled system is:

$d\mathbf{x}_t = -\boldsymbol\beta_t \mathbf{x}_{t-} dt + \boldsymbol\alpha\, d\mathbf{N}_t,\quad \boldsymbol\beta_t = f_\theta(\mathcal{H}_{t-}),\quad \boldsymbol\lambda_t = \sigma\bigl(\boldsymbol\mu + W\,\mathbf{x}_{t-}\bigr)$

where:

$\mathbf{N}_t \in \{0,1\}^K$ is the counting process indicator;
$\boldsymbol\alpha \in \mathbb{R}^{d\times K}$ collects mark-specific impulse vectors;
$W \in \mathbb{R}^{K \times d}$ , $\boldsymbol\mu \in \mathbb{R}^K$ , and $\sigma(z) = \log(1 + e^z)$ ensure nonnegative intensities;
$f_\theta$ is a history-encoded hypernetwork providing dynamics.

Between events $t_i < t < t_{i+1}$ , $f_\theta(\mathcal{H}_{t_i})$ fixes $\boldsymbol\beta_t = \beta_i = V_i D_i V_i^*$ , permitting the closed-form state update:

$\mathbf{x}_{t} = V_i\,e^{D_i\,(t-t_i)}\,V_i^*\,\mathbf{x}_{t_i}$

At each event $(i+1)$ of type $k_{i+1}$ , the latent state is updated by:

$\mathbf{x}_{t_{i+1}} = \mathbf{x}_{t_{i+1}-} + \boldsymbol\alpha_{k_{i+1}}$

Across the whole trajectory, the latent process is thus governed by a piecewise, conditionally linear recurrence.

2. Latent Dimensional Lifting and Expressivity

In classical linear Hawkes, the latent and mark dimensions coincide ( $d = K$ ), with parameters $\beta, \alpha \in \mathbb{R}^{K \times K}$ . HHP lifts this rigidity, allowing $d\gg K$ for expressivity or $d < K$ for compression:

$d\mathbf{x}_t = -\beta_t \mathbf{x}_{t-} dt + \sum_{k=1}^K \alpha_k\, dN_t^k,\quad \boldsymbol\lambda_t = \sigma\bigl(\boldsymbol\mu + W\,\mathbf{x}_{t-}\bigr)$

Each mark-specific event injects a vector $\alpha_k$ , and $W$ projects the high-dimensional state to the $K$ -dimensional intensity. This decoupling enables HHP to model dependencies unapproachable by standard Hawkes models, while retaining analytic tractability of the latent process.

3. Hypernetwork Dynamics and Architecture

The decay/control matrix $\beta_t$ is history- and time-adaptive via a neural hypernetwork based on a GRU. For each event index $i$ , the hypernetwork maintains a hidden state $z_i \in \mathbb{R}^h$ :

$z_i = \mathrm{GRU}_\phi(z_{i-1}, [\log(t_i - t_{i-1}), e_{k_i}]),\quad z_0 = \mathbf{0}$

From $z_i$ :

$d_i = W_d z_i + b_d \in \mathbb{R}^d$
$D_i = -\mathrm{diag}(\mathrm{softplus}(d_i) \odot u)$ with $\Re(D_i) < 0$
$v_i = W_v z_i + b_v \in \mathbb{R}^{2dr}$
$V_i = \mathrm{unitary}(v_i)$ (using a standard parameterization to output unitary matrices, a la Jing et al. 2017)

Thus, $\{V_i, D_i\} = f_\theta(\mathcal{H}_{t_i})$ forms the eigendecomposition of $\beta_t$ over $(t_i, t_{i+1}]$ .

4. Interpretability and Linear Attribution Mechanisms

The conditional linearity of the update law enables decomposition of the latent state into per-event "particles" for events $j \leq i$ , for any $t \in (t_i, t_{i+1}]$ :

$\mathbf{x}_t^{(j)} = W \left( \prod_{k=j}^i V_k e^{D_k(\min\{t, t_{k+1}\} - t_k)} V_k^* \right) \alpha_{k_j}$

$\boldsymbol\lambda_t = \sigma\left(\boldsymbol\mu + \sum_{j=1}^i \mathbf{x}_t^{(j)}\right)$

In the limit where $\beta$ is constant (classical Hawkes), this reduces to the well-known exponential decay form:

$\mathbf{x}_t^{(j)} = e^{-\beta (t-t_j)} \alpha_{k_j}$

This structure permits precise attribution of instantaneous and cumulative influence for each event via leave-one-out probes:

$\mathrm{DF}\lambda_t^{(j)} = \lambda_t - \sigma \left( \boldsymbol\mu + \sum_{i\neq j} \mathbf{x}_t^{(i)} \right)$

$\mathrm{DF}\Lambda_t^{(j)} = \int_0^t \mathrm{DF}\lambda_s^{(j)} ds$

Such closed-form probes can determine the degree to which each past event excites or inhibits the process, generalizing the transparency of classical Hawkes models to the more expressive HHP framework.

5. Training Procedure and Inference Workflow

HHP is trained by maximizing the standard log-likelihood for MTPPs:

$\mathcal{L}(\mathcal{H}_T) = \sum_{i=1}^{N_T} \log \lambda_{t_i}^{k_i} - \int_0^T \sum_{k=1}^K \lambda_s^k ds$

The time integral term is approximated via uniform sampling in each inter-event interval, and the hypernetwork is re-evaluated only at event times. The parameter set $\theta = \{ \phi, \boldsymbol\alpha, W, \boldsymbol\mu \}$ is optimized end-to-end with Adam, using only early stopping and weight decay (search over latent dimension $d$ , GRU hidden size, etc.) with no additional regularization.

6. Benchmarking and Empirical Performance

HHP was evaluated across diverse real-world datasets: Amazon reviews, Retweet cascades, NY Taxi pickups, Taobao purchases, StackOverflow posts, Last.fm listening logs, and MIMIC-II medical events. Metrics included per-event log-likelihood (time- and mark-decomposed), next-time RMSE, next-mark accuracy, and calibration (PCE for time, ECE for marks), aggregated by a composite rank.

The principal baselines were: RMTPP, NHP, SAHP, THP, IFTPP, AttNHP, and S2P2. HHP achieved a composite average rank of 2.6 (placing as best or second-best on 4 of 6 metrics), with particular strength in time RMSE (1.4) and mark accuracy (1.7), and matched state-of-the-art log-likelihood (rank 2.0 against S2P2’s 1.9). Notably, HHP required on average 54% fewer parameters than S2P2, while maintaining top-tier predictive performance.

Dataset	Best/Second-Best Metrics	Parameter Efficiency
Amazon, Retweet, ...	4/6 metrics (#1 or #2)	54% fewer than S2P2

7. Synthesis: Flexibility, Interpretability, and Research Context

HHP fundamentally bridges the dichotomy between classical and neural MTPP models. By maintaining the linear Hawkes recurrence, HHP preserves the closed-form, per-event attribution—enabling rigorous interpretability probes for direct inspection of model predictions at the event level. Simultaneously, the inclusion of a hypernetwork responsible for generating piecewise constant, history-conditioned decay dynamics ( $\beta_t$ ), and the increased latent state dimension, ameliorate limitations in expressivity observed in standard Hawkes frameworks. The model thus exhibits non-stationary, adaptive temporal memory, combining the transparent structure of Hawkes with the flexibility and performance previously characteristic only of neural MTPPs. The empirical results demonstrate that HHP’s interpretability does not come at the expense of predictive prowess, offering a route towards interpretable, high-capacity event modeling in real-world temporal domains.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Hyper Hawkes Process (HHP).