Papers
Topics
Authors
Recent
2000 character limit reached

Deep Synthetic Cross-project SRGM

Updated 24 September 2025
  • The paper introduces a deep synthetic framework that integrates SRGMs, synthetic data generation, and deep learning to enhance defect prediction.
  • Methodology involves generating synthetic defect sequences, applying cross-correlation filtering, and training stacked LSTM models on similar data clusters.
  • Empirical evaluation on 60 datasets shows up to 23% improvement in MAPE and significant RMSE/MAE reductions, confirming the approach's practical efficacy.

Deep Synthetic Cross-project Software Reliability Growth Modeling (DSC-SRGM) is an advanced methodology for predicting software reliability growth in scenarios where defect discovery data are scarce or incomplete. By integrating traditional software reliability growth models (SRGMs), synthetic data generation, cross-correlation-based dataset selection, and deep learning, DSC-SRGM provides a principled, transferable framework for forecasting defect accumulation trends across diverse software projects. This paradigm is notable for its ability to circumvent common obstacles of proprietary or insufficient failure data by leveraging synthetic time series whose statistical properties have been constructed to closely emulate real-world defect discovery processes (Kim et al., 21 Sep 2025).

1. Theoretical Foundation

The core theoretical underpinning of DSC-SRGM is the synthesis of defect discovery curves via classical SRGM formulations, followed by selective deep learning model training using data filtered according to sequential correlation with the target project's observed trends. Traditional SRGMs such as Goel–Okumoto (GO), Yamada Delayed S-Shaped (YDSS), Inflection S-Shaped (ISS), and Generalized Goel (GG) models are employed to generate cumulative defect curves characterized by non-linear, monotonic growth, parameterized by interpretable software process metrics (e.g., initial total defectivity aa, detection rate bb). For a given SRGM, the prototypical formula is:

$m(t) = a \cdot (1 - e^{-b t}) \tag{GO model}$

where m(t)m(t) denotes cumulative defects at process time tt, aa is the expected total defect count, and bb is the detection rate. Similar parametric forms are used for other SRGMs, with additional parameters (e.g., rr in ISS, cc in GG) sampled from plausible ranges to capture variability observed in empirical datasets.

Synthetic defect sequences

Tnoisy(t)=T(t)(1+ϵ(t)),  ϵ(t)N(0,0.0012)T_{\text{noisy}}(t) = T(t) \cdot (1 + \epsilon(t)), \; \epsilon(t) \sim \mathcal{N}(0, 0.001^2)

are additionally perturbed with low-variance Gaussian noise to mimic real-world recording uncertainty, with subsequent monotonicity enforcement:

Tfinal(t)=max(0,Tnoisy(t),Tfinal(t1))T_{\text{final}}(t) = \max(0, T_{\text{noisy}}(t), T_{\text{final}}(t-1))

to ensure admissibility as defect time series.

2. Synthetic Dataset Generation

DSC-SRGM's synthetic data pipeline systematically generates a large population of defect trend time series reflecting diverse but realistic reliability trajectories. Major steps include:

  • Uniform random selection of an SRGM (GO, YDSS, ISS, GG) for each simulation instance.
  • Parameter initialization: cumulative defect parameter aa (commonly set to 100), detection rate bb from log-uniform (0.0001,1.0](0.0001, 1.0], other SRGM-specific parameters from uniform distributions, chosen to span the space of plausible defect growth behaviors.
  • Sequence truncation using a 95% defect coverage threshold (T(t)0.95aT(t) \geq 0.95a) or hard time cutoff (t=512t = 512) to manage practical sequence lengths and eliminate prolonged asymptotic tails.
  • Noise injection as above, followed by non-negativity and monotonicity adjustments.

By executing hundreds or thousands of such runs, DSC-SRGM builds a repository of statistically diverse synthetic reliability growth curves, all subject to the structural constraints of established SRGM forms.

3. Similarity Filtering and Cluster-based Selection

To address cross-project differences and avoid negative transfer, a cross-correlation similarity analysis is conducted to filter synthetic datasets prior to model training. For time-aligned defect count sequences x(t),y(t)x(t), y(t), the lagged cross-correlation is defined as:

CCx,y(τ)=t(x(t)xˉ)(y(t+τ)yˉ)t(x(t)xˉ)2t(y(t+τ)yˉ)2CC_{x,y}(\tau) = \frac{\sum_{t} (x(t) - \bar{x})(y(t+\tau) - \bar{y})}{\sqrt{\sum_{t} (x(t) - \bar{x})^2} \sqrt{\sum_{t} (y(t+\tau) - \bar{y})^2}}

where τ\tau is a temporal lag, and xˉ,yˉ\bar{x}, \bar{y} are means. The maximum value over lags, maxτCCx,y(τ)\max_{\tau} CC_{x,y}(\tau), is adopted as the overall sequence similarity score. A cross-correlation matrix is computed for all (synthetic, target project) pairs, and K-means clustering (with KK selected via the elbow method, e.g., K=3K=3) partitions the dataset pool. Only synthetic time series grouped in the same cluster as the target project are retained for training, ensuring that deep models are exposed primarily to synthetic data with highly similar temporal properties.

Table 1. Cross-correlation Clustering Process

Step Method Output
Similarity computation Max cross-correlation over lags Similarity matrix
Clustering K-means (typically K=3K=3) Partitioned synthetic and real datasets
Dataset selection Matching target’s cluster Synthetic datasets for training

4. Deep Learning Model Training and Prediction

DSC-SRGM utilizes a stacked Long Short-Term Memory (LSTM) network for learning temporal dependencies in defect trends:

  • Input: sliding window of 8 previous time steps of cumulative defects (Min–Max normalized per instance) as model features.
  • Model architecture: multiple stacked LSTM layers (128 hidden units per layer); dropout applied for regularization.
  • Training: supervised learning on synthetic datasets selected as above, using next-step prediction as the target.
  • Inference: recursive forecasting, where predicted defects for the next time point are denormalized and appended, serving as future model input.

This setup enables multi-step extrapolation of cumulative defect discovery in the target project, relying on cross-project generalization from surrogate synthetic data.

5. Empirical Evaluation and Performance

In evaluations on 60 real-world defect datasets (early-phase reliability prediction, i.e., training on first 50% of data), DSC-SRGM achieved substantial quantitative improvements. When compared to best-fit traditional SRGM (selected by minimal mean squared error) and a deep cross-project model trained on only real-world data (DC-SRGM), DSC-SRGM resulted in:

  • Up to 23.3% higher accuracy in Mean Absolute Percentage Error (MAPE)
  • RMSE and MAE reductions of approximately 13.7%–14.8% versus the best traditional SRGM
  • RMSE and MAE reductions by 32.1% and 32.2% compared to DC-SRGM

Hybrid approaches that naively combined synthetic and real data (Hybrid-SRGM) did not yield further improvements and could degrade predictive accuracy. Statistical tests (Wilcoxon Signed-Rank, Friedman) were employed to confirm the significance of observed advantages (Kim et al., 21 Sep 2025).

6. Limitations and Parameter Sensitivities

While DSC-SRGM exhibits strong performance in data-limited contexts, effectiveness hinges on several factors:

  • Synthetic dataset quantity: Excess generation may expand the similarity cluster beyond optimal bounds, introducing irrelevant or redundant information that impairs model fidelity.
  • Noise and distribution balance: Injection of excessive or insufficient stochasticity distorts trend realism; 0.1% Gaussian noise was empirically found optimal.
  • Termination criteria: Choices for sequence truncation (95% completion, t=512t=512) prevent unrealistic long tails and maintain dataset diversity.
  • Clustering granularity: Choice of KK in K-means clustering impacts both coverage (ensuring similar-enough training data) and overfitting risk.

A plausible implication is that nuanced tuning of these parameters is essential for generalization, particularly when transferring across heterogeneous project domains.

7. Synthesis, Impact, and Future Research

DSC-SRGM demonstrates that, under controlled synthetic generation and principled similarity-based selection, deep learning models can achieve reliable software defect prediction in environments where real-world defect data are insufficient or unavailable. This approach offers significant advantages in early-stage, cross-project, and data-protected settings.

Future work is suggested in the following directions (Kim et al., 21 Sep 2025):

  • Exploring advanced synthetic data generation with constraints or sophisticated parametric sampling.
  • Enhancing clustering and similarity metrics for even finer matching between synthetic and real project data.
  • Investigating hybrid strategies for merging synthetic and empirical data without introducing adverse interactions.
  • Extending input and model architectures to accommodate complex, multi-phase defect discovery patterns.

In summary, Deep Synthetic Cross-project SRGM constitutes a robust and extensible framework for software reliability growth modeling, blending established statistical techniques, synthetic data engineering, and deep temporal modeling to overcome longstanding practical barriers in reliability estimation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Deep Synthetic Cross-project SRGM (DSC-SRGM).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube