Sigmoid-Gaussian Hawkes Processes Explained

Updated 16 March 2026

Sigmoid-Gaussian Hawkes Processes are point process models that use a sigmoid function to introduce nonlinear history dependence and flexible GP-based kernel estimation.
They integrate excitatory and inhibitory effects by modeling both baseline and self-kernels nonparametrically, suitable for diverse fields like neuroscience and social networks.
Efficient inference via Pólya–Gamma augmentation and mean-field variational Bayes enables scalable analysis of complex, high-dimensional event data.

Sigmoid-Gaussian Hawkes Processes (SGHP) are a flexible class of point process models that generalize the classical linear Hawkes process by introducing nonlinear history dependence via a sigmoid link function and modeling both the baseline intensity and self-excitation (or inhibition) kernels nonparametrically using Gaussian Processes (GPs). These models have emerged as powerful tools for capturing complex temporal dependencies in multivariate event data, accommodating both excitatory and inhibitory effects, as well as nonlinearities and nonparametric temporal dynamics. SGHPs are at the forefront of modern stochastic modeling in neuroscience, criminology, social networks, and other fields with rich event-driven temporal data.

1. Mathematical Structure and Model Definition

Let $N(t)$ denote a univariate or multivariate counting process representing event times $\{t_n\}$ in $[0,T]$ . In the univariate setting, the SGHP conditional intensity at time $t$ is expressed as: $\Lambda(t|\mathcal{H}_t) = \lambda\,\sigma(\phi(t)), \qquad \sigma(u) = \frac{1}{1+e^{-u}}$ where $\mathcal{H}_t = \{t_n : t_n < t\}$ is the event history, $\lambda > 0$ is an upper bound on the intensity (often with a Gamma prior), and $\phi(t)$ is the “linear predictor,” given by

$\phi(t) = s(t) + \sum_{n:t_n < t} g(t - t_n)\,e^{-\alpha (t-t_n)}$

Here, $s(t)$ encodes a nonparametric background rate and $g(\tau)$ is a nonparametric (possibly sign-changing) self-kernel. Both $s(\cdot)$ and $g(\cdot)$ are modeled as zero-mean GPs with RBF (squared-exponential) kernels, controlling for smoothness and amplitude; the exponential factor $\alpha > 0$ imposes finite memory. The likelihood of observed times is: $p(\{t_n\}|\phi,\lambda) = \exp\left(-\int_0^T \lambda\,\sigma(\phi(t))\,dt\right)\;\prod_{n=1}^N \lambda\,\sigma(\phi(t_n))$ This formulation captures both excitation ( $g(\tau) > 0$ ) and inhibition ( $g(\tau) < 0$ ) within the same flexible framework (Malem-Shinitski et al., 2021).

In multivariate and high-dimensional settings, each component’s intensity aggregates the effects of $d$ interacting processes: $\lambda_i(t) = \sigma\left(\mu_i + \sum_{j=1}^d \int_0^t g_{ij}(t-s)\,dN_j(s)\right),\quad 1 \leq i \leq d$ with Gaussian-shaped or nonparametric $g_{ij}(\tau)$ , extending to complex networks and enabling modeling of both mutual excitation and inhibition (Chen et al., 2017).

2. Gaussian Process Priors and Nonparametric Modeling

SGHPs use GP priors to model both the baseline and self/interaction kernels nonparametrically, permitting adaptive sharing of statistical strength in low-data regimes. Specifically,

$s \sim \mathcal{GP}(0, K^s), \qquad g \sim \mathcal{GP}(0, K^g)$

with squared-exponential kernels

$K^{\bullet}(u, v) = a_{\bullet}\exp\left(-\frac{(u - v)^2}{\sigma_{\bullet}^2}\right)$

The resulting predictor $\phi(t)$ itself is a GP, with covariance incorporating historical contributions: $\widetilde K(t,t') = K^s(t, t') + \sum_{t_i < t}\sum_{t_j < t'} K^g(t - t_i, t' - t_j)\,e^{-\alpha[(t-t_i)+(t'-t_j)]}$ This structure allows SGHPs to capture nonstationary, oscillatory, and complex event triggering phenomena. In sparse-data settings, shared GP covariance regularizes the estimation problem, reducing overfitting. In high-event-count regimes, sparse-inducing point GP approximations are introduced for computational efficiency (Malem-Shinitski et al., 2021, Zhou et al., 2019, Sulem et al., 2022).

3. Conjugate Bayesian Inference via Pólya–Gamma and Poisson Augmentation

Inference in SGHPs is nontrivial due to the nonlinear, nonparametric, and non-conjugate likelihood. Analytical tractability is achieved through Pólya–Gamma (PG) augmentation for each sigmoid term, coupled with marked Poisson process augmentation (Campbell’s theorem) for the time-integrals:

For any $u$ , $\sigma(u)$ is expressed as a mixture:

$\sigma(u) = \int_0^\infty \exp\left(-\frac12 w u^2 + \frac12 u - \ln 2\right)\,PG(w; 1, 0)\,dw$

introducing auxiliary PG variables.

The exponential survival term is rewritten as an expectation over an inhomogeneous marked Poisson process in $(t, w)$ .

The joint posterior over intensity functions, auxiliary PG variables, and latent Poisson marks becomes amenable to efficient coordinate-ascent variational Bayes (VB) updates or closed-form EM updates. These strategies avoid the need for explicit event-by-event branching structure inference, critical for scalability (Zhou et al., 2019, Malem-Shinitski et al., 2021, Sulem et al., 2022).

4. Mean-Field Variational Bayes and Adaptive Inference

Mean-field VB posits a factorized posterior over the intensity GP, PG and Poisson latent variables, and hyperparameters: $q(\phi, \lambda, \{w_n\}, \{\hat t_m, \hat w_m\}) = q_1(\phi, \lambda)\;q_2(\{w_n\}, \{\hat t_m, \hat w_m\})$ Coordinate-wise optimization yields closed-form updates:

The GP posterior (on inducing points) remains Gaussian, and its mean/variance can be updated via standard sparse-GP formulas.
PG variables are updated with $q_2(w_n) = PG(1, \sqrt{\mathbb{E}_{q_1}[\phi(t_n)^2]})$ .
Ghost point processes are updated as marked Poisson processes with intensities depending on expectations under the current variational distribution.

Adaptive model selection—particularly for sparsity and dictionary-size in the basis expansion case—can be incorporated via spike-and-slab priors and two-stage selection on the graph and function class. Empirical ELBO maximization guides selection, and averaging over models is supported as well (Sulem et al., 2022).

5. Theoretical Properties: Stationarity, Weak Dependence, and Concentration Bounds

The choice of sigmoid link and Gaussian kernels introduces bounded memory and non-mutually-excitatory effects. Under the Lipschitz assumption (for the sigmoid, $L_h = 1/4$ ) and integrable transfer kernels, stationarity is guaranteed for spectral radius $\rho(\Omega) < 1$ , where $\Omega_{ij} = L_h \int |g_{ij}(s)| ds$ .

Advanced dependence and concentration results include:

Weak $\tau$ -dependence and explicit exponential tail bounds on covariance decay, even under inhibitory effects.
Bernstein-type concentration inequalities for second-order event statistics.
High-dimensional consistency guarantees for penalized moment-matching recovery of amplitude matrices $A=(\alpha_{ij})$ , with error rates scaling as $\sqrt{(s\log d) / T^{(2r+1)/(5r+2)}}$ for $s$ -sparse interaction graphs (Chen et al., 2017, Sulem et al., 2022).

6. Computational Complexity and Scalability

SGHP inference leverages scalable, closed-form updates and efficient numerical quadrature:

Per-iteration cost for VB or EM is $O(P^3 + P^2N)$ for $P$ inducing points and $N$ events, dominated by kernel matrix operations; $P \ll N$ is sufficient in practice.
Updates for branching structure, PG variables, and quadrature scale at $O(NL)$ , where $L$ is average nonzero history length.
Overall complexity is approximately linear in $N$ , making high-dimensional models feasible (e.g., $K$ up to 64 with acceptable compute time) (Malem-Shinitski et al., 2021, Sulem et al., 2022, Zhou et al., 2019).

7. Empirical Results and Applications

Experimental validation covers synthetic, simulated, and real-world datasets:

On synthetic traces, both VB and EM efficiently recover true intensities and kernels, outperforming classical Wiener–Hopf and MMEL nonparametric methods in MSE, with VB orders of magnitude faster than MCMC.
On crime event data (Vancouver, NYPD), SGHP achieves held-out log-likelihood comparable to or exceeding earlier models, confirming flexibility in capturing real event dependencies.
In neuronal spiking, inhibitory effects are robustly detected; time-rescaling QQ-plots exhibit KS $p$ -values supporting good fit.
For social media (retweet cascades), multivariate extensions yield higher predictive agreement and fit than univariate versions.
Scalability experiments confirm excellent graph/parameter recovery and robustness to mild kernel/link-function misspecification (Malem-Shinitski et al., 2021, Zhou et al., 2019, Sulem et al., 2022).

Domain	Inference: Accuracy	Comments/Highlights
Synthetic	VB/EM accurately recover $\mu$ / $\phi$ (low MSE)	Outperform classic nonparametrics; scalable
Crime/neural	Test log-likelihood and QQ goodness-of-fit	Captures excitation/inhibition, good statistical fit
Social network	Higher log-likelihood, $p$ -value (multi.)	Adapts to grouped user cascades

All key algorithms are available in open-source Python/JAX codebases, with hyperparameters tuned via ADAM, GP noise jitter set to $10^{-4}$ , 100–300 inducing points, and support for both Gibbs and variational Bayes (Malem-Shinitski et al., 2021).

References

(Malem-Shinitski et al., 2021) Nonlinear Hawkes Process with Gaussian Process Self Effects
(Zhou et al., 2019) Scalable Inference for Nonparametric Hawkes Process Using Pólya-Gamma Augmentation
(Chen et al., 2017) The Multivariate Hawkes Process in High Dimensions: Beyond Mutual Excitation
(Sulem et al., 2022) Scalable and adaptive variational Bayes methods for Hawkes processes