- The paper introduces a framework to infer stochastic differential equation (SDE) models from time-series data of jump processes, designed to detect nonlinear memory in systems like cell division.
- Applying the framework reveals significant nonlinear memory of mother size in species like E. coli and Dictyostelium discoideum, which goes beyond conventional linear-memory cell division models.
- This data-driven inference approach is broadly applicable to model diverse stochastic jump processes in various fields, demonstrated with examples from healthcare data and online activity.
This paper (2408.14564) introduces a practical framework for inferring stochastic differential equation (SDE) models with inhomogeneous Poisson noise directly from time-series data of discontinuous jump processes. The primary motivation is to analyze cell growth and division dynamics, which involve continuous growth interrupted by discrete division events. The framework is specifically designed to detect and quantify nonlinear memory effects, which are often not captured by conventional cell homeostasis models like "sizer", "adder", and "timer".
The core idea is to model cell size dynamics st using an SDE:
dst=g(st)dt−h(st−)dN(t)
where:
- g(st) is the deterministic growth rate function.
- h(st−) is the deterministic cut size upon division, representing the size decrease of the mother cell.
- dN(t) is a Poisson counting process that signals division events. The rate of this process, λ, is history-dependent, λ(st,st∗,st∗∗,...), where st is the current size, st∗ is the mother cell size (size at the last division), st∗∗ is the grandmother size, and so on, capturing potential memory effects across generations.
The framework infers the functional forms of g, h, and λ from experimental cell-size trajectories.
- Inference of g(s) and h(s): These are inferred using standard linear regression on the continuous growth phases and the size jumps at division, respectively. For biological data, a linear growth rate g(s)=g0+g1s and a linear cut size h(s)=h0+h1s were found sufficient in many cases.
- Inference of λ: This is the central contribution. To model the potentially nonlinear and memory-dependent division rate λ(st,st∗), the paper proposes inferring lnλ by expanding it in terms of a set of orthogonal basis functions θij(st,st∗)=ϕi(st)ψj(st∗), where ϕi and ψj are orthogonal polynomials constructed directly from the empirical distributions of st and st∗. The coefficients wij of this expansion:
lnλ(st,st∗)=i,j∑wijϕi(st)ψj(st∗)
are inferred using sparse Bayesian inference. This involves maximizing the posterior probability P(w∣data), which combines a likelihood function derived from the Poisson process and a sparsity-promoting Gaussian prior on the weights wij. An Expectation-Maximization (EM) algorithm is used to iteratively estimate the weights and the prior variances.
- Model Selection: To avoid overfitting and select the most parsimonious model, a modified Bayesian Information Criterion (BIC) is used. Models with fewer, but more informative, terms in the basis expansion of lnλ are favored.
Applying this framework to mother-machine data for various species:
- Escherichia coli [10]: Found exponential growth and symmetric division. The inferred division rate λ(st,st∗) shows a significant nonlinear dependence on the mother size st∗. This indicates that cells with smaller mother sizes tend to divide faster relative to their size than predicted by linear-memory models, suggesting a mechanism to correct size deviations more aggressively for smaller cells. Analysis of the joint distribution of mother and grandmother sizes confirms substantial memory beyond one generation, although adding two-generation memory did not significantly improve the model's BIC score for E. coli, suggesting one-generation memory is sufficient for the observed dynamics.
- Schizosaccharomyces pombe [13]: Found linear growth and symmetric division. The inferred λ(st,st∗) shows a weaker, more nearly linear dependence on st∗, consistent with previous studies suggesting a sizer-like mechanism with weak memory. Singular value decomposition of the mother-grandmother size distribution indicates much weaker memory compared to E. coli.
- Dictyostelium discoideum [11]: Similar to E. coli, shows strong nonlinear memory of the mother size.
- Bacillus subtilis [12]: Exhibits weaker nonlinear memory, similar to S. pombe.
The paper proposes quantifying the degree of nonlinear memory by fitting a quadratic curve s(s∗)∼α1s∗+α2(s∗)2 to the boundary in the (s,s∗) plane where the division rate λ transitions from low (growth) to high (division). Linear-memory models correspond to α2=0. By plotting (α1,α2) for different species, they show that E. coli and D. discoideum fall outside the region of conventional linear models (sizer, adder, timer), highlighting their substantial nonlinear memory.
Practical Applications and Implementation:
- General Model Discovery: The core inference framework is generic and not limited to cell division. It can be applied to any system generating stochastic time series with discrete jump events where the rate of jumps may depend on past states.
- Examples Beyond Biology (SI Appendix): The framework is demonstrated on:
- Stack Overflow badge acquisition history (user activity data).
- Clinical visit history in an ICU (healthcare data).
- Earthquake occurrences (geoscience data).
- For these examples, the jump rate (badge acquisition rate, visit rate, earthquake rate) is modeled as depending on the waiting time since the last event and the waiting time between the two previous events. The framework successfully identifies sparse models that capture the statistics of these discrete events.
- Implementation Details:
- The approach relies on constructing orthogonal basis functions from the specific dataset's empirical distributions, which helps in capturing the relevant dynamics efficiently.
- Sparse Bayesian inference provides a principled way to handle noisy data and avoid overfitting by penalizing model complexity (number of terms in the basis expansion).
- The use of the EM algorithm for hyperparameter inference and L-BFGS for optimization are standard numerical techniques.
- Model selection using a modified BIC allows for automated discovery of the best-fitting parsimonious model.
- The framework is shown to be robust to the choice of basis functions and regularization methods (Lasso, Ridge, Elastic Net).
- The paper mentions that the framework can be integrated with deep learning techniques (e.g., using neural networks to approximate lnλ and Adam optimizer), providing flexibility and the potential to leverage established machine learning libraries.
Implementation Considerations:
- Data Requirements: The framework requires high-resolution time-series data tracking the relevant state variables (e.g., cell size) and clearly identifiable jump events (e.g., divisions). Enough data is needed to reliably construct empirical distributions and orthogonal basis functions.
- Preprocessing: Data needs preprocessing to filter out anomalies (like chaining in cell data) and identify jump times and associated pre-jump states (mother size, etc.).
- Computational Cost: While sparse Bayesian inference and EM are generally efficient for moderate numbers of basis functions, the computational cost can increase with the complexity of the chosen basis and the size of the dataset. Scaling to very high-dimensional state spaces or extremely large datasets might require leveraging GPU acceleration or distributed computing, particularly if integrating with deep learning.
- Basis Function Choice: While the paper shows robustness to different basis types, selecting appropriate basis functions or their functional form (e.g., polynomial degree, kernel width) might require some domain knowledge or empirical tuning.
- Validation: The paper emphasizes validating the learned model by simulating it and comparing its statistics (e.g., division size distribution, generation time distribution, memory correlations) against the original experimental data.
In summary, this paper provides a powerful and flexible data-driven approach for discovering governing SDEs for stochastic jump processes. Its application to cell division data reveals previously underappreciated nonlinear memory effects across species, offering a richer understanding of cell size control. The generic formulation and demonstrated applicability to diverse real-world datasets highlight its broad potential beyond biological research.