Deep Synthetic Cross-project SRGM
- The paper introduces a deep synthetic framework that integrates SRGMs, synthetic data generation, and deep learning to enhance defect prediction.
- Methodology involves generating synthetic defect sequences, applying cross-correlation filtering, and training stacked LSTM models on similar data clusters.
- Empirical evaluation on 60 datasets shows up to 23% improvement in MAPE and significant RMSE/MAE reductions, confirming the approach's practical efficacy.
Deep Synthetic Cross-project Software Reliability Growth Modeling (DSC-SRGM) is an advanced methodology for predicting software reliability growth in scenarios where defect discovery data are scarce or incomplete. By integrating traditional software reliability growth models (SRGMs), synthetic data generation, cross-correlation-based dataset selection, and deep learning, DSC-SRGM provides a principled, transferable framework for forecasting defect accumulation trends across diverse software projects. This paradigm is notable for its ability to circumvent common obstacles of proprietary or insufficient failure data by leveraging synthetic time series whose statistical properties have been constructed to closely emulate real-world defect discovery processes (Kim et al., 21 Sep 2025).
1. Theoretical Foundation
The core theoretical underpinning of DSC-SRGM is the synthesis of defect discovery curves via classical SRGM formulations, followed by selective deep learning model training using data filtered according to sequential correlation with the target project's observed trends. Traditional SRGMs such as Goel–Okumoto (GO), Yamada Delayed S-Shaped (YDSS), Inflection S-Shaped (ISS), and Generalized Goel (GG) models are employed to generate cumulative defect curves characterized by non-linear, monotonic growth, parameterized by interpretable software process metrics (e.g., initial total defectivity , detection rate ). For a given SRGM, the prototypical formula is:
$m(t) = a \cdot (1 - e^{-b t}) \tag{GO model}$
where denotes cumulative defects at process time , is the expected total defect count, and is the detection rate. Similar parametric forms are used for other SRGMs, with additional parameters (e.g., in ISS, in GG) sampled from plausible ranges to capture variability observed in empirical datasets.
Synthetic defect sequences
are additionally perturbed with low-variance Gaussian noise to mimic real-world recording uncertainty, with subsequent monotonicity enforcement:
to ensure admissibility as defect time series.
2. Synthetic Dataset Generation
DSC-SRGM's synthetic data pipeline systematically generates a large population of defect trend time series reflecting diverse but realistic reliability trajectories. Major steps include:
- Uniform random selection of an SRGM (GO, YDSS, ISS, GG) for each simulation instance.
- Parameter initialization: cumulative defect parameter (commonly set to 100), detection rate from log-uniform , other SRGM-specific parameters from uniform distributions, chosen to span the space of plausible defect growth behaviors.
- Sequence truncation using a 95% defect coverage threshold () or hard time cutoff () to manage practical sequence lengths and eliminate prolonged asymptotic tails.
- Noise injection as above, followed by non-negativity and monotonicity adjustments.
By executing hundreds or thousands of such runs, DSC-SRGM builds a repository of statistically diverse synthetic reliability growth curves, all subject to the structural constraints of established SRGM forms.
3. Similarity Filtering and Cluster-based Selection
To address cross-project differences and avoid negative transfer, a cross-correlation similarity analysis is conducted to filter synthetic datasets prior to model training. For time-aligned defect count sequences , the lagged cross-correlation is defined as:
where is a temporal lag, and are means. The maximum value over lags, , is adopted as the overall sequence similarity score. A cross-correlation matrix is computed for all (synthetic, target project) pairs, and K-means clustering (with selected via the elbow method, e.g., ) partitions the dataset pool. Only synthetic time series grouped in the same cluster as the target project are retained for training, ensuring that deep models are exposed primarily to synthetic data with highly similar temporal properties.
Table 1. Cross-correlation Clustering Process
| Step | Method | Output |
|---|---|---|
| Similarity computation | Max cross-correlation over lags | Similarity matrix |
| Clustering | K-means (typically ) | Partitioned synthetic and real datasets |
| Dataset selection | Matching target’s cluster | Synthetic datasets for training |
4. Deep Learning Model Training and Prediction
DSC-SRGM utilizes a stacked Long Short-Term Memory (LSTM) network for learning temporal dependencies in defect trends:
- Input: sliding window of 8 previous time steps of cumulative defects (Min–Max normalized per instance) as model features.
- Model architecture: multiple stacked LSTM layers (128 hidden units per layer); dropout applied for regularization.
- Training: supervised learning on synthetic datasets selected as above, using next-step prediction as the target.
- Inference: recursive forecasting, where predicted defects for the next time point are denormalized and appended, serving as future model input.
This setup enables multi-step extrapolation of cumulative defect discovery in the target project, relying on cross-project generalization from surrogate synthetic data.
5. Empirical Evaluation and Performance
In evaluations on 60 real-world defect datasets (early-phase reliability prediction, i.e., training on first 50% of data), DSC-SRGM achieved substantial quantitative improvements. When compared to best-fit traditional SRGM (selected by minimal mean squared error) and a deep cross-project model trained on only real-world data (DC-SRGM), DSC-SRGM resulted in:
- Up to 23.3% higher accuracy in Mean Absolute Percentage Error (MAPE)
- RMSE and MAE reductions of approximately 13.7%–14.8% versus the best traditional SRGM
- RMSE and MAE reductions by 32.1% and 32.2% compared to DC-SRGM
Hybrid approaches that naively combined synthetic and real data (Hybrid-SRGM) did not yield further improvements and could degrade predictive accuracy. Statistical tests (Wilcoxon Signed-Rank, Friedman) were employed to confirm the significance of observed advantages (Kim et al., 21 Sep 2025).
6. Limitations and Parameter Sensitivities
While DSC-SRGM exhibits strong performance in data-limited contexts, effectiveness hinges on several factors:
- Synthetic dataset quantity: Excess generation may expand the similarity cluster beyond optimal bounds, introducing irrelevant or redundant information that impairs model fidelity.
- Noise and distribution balance: Injection of excessive or insufficient stochasticity distorts trend realism; 0.1% Gaussian noise was empirically found optimal.
- Termination criteria: Choices for sequence truncation (95% completion, ) prevent unrealistic long tails and maintain dataset diversity.
- Clustering granularity: Choice of in K-means clustering impacts both coverage (ensuring similar-enough training data) and overfitting risk.
A plausible implication is that nuanced tuning of these parameters is essential for generalization, particularly when transferring across heterogeneous project domains.
7. Synthesis, Impact, and Future Research
DSC-SRGM demonstrates that, under controlled synthetic generation and principled similarity-based selection, deep learning models can achieve reliable software defect prediction in environments where real-world defect data are insufficient or unavailable. This approach offers significant advantages in early-stage, cross-project, and data-protected settings.
Future work is suggested in the following directions (Kim et al., 21 Sep 2025):
- Exploring advanced synthetic data generation with constraints or sophisticated parametric sampling.
- Enhancing clustering and similarity metrics for even finer matching between synthetic and real project data.
- Investigating hybrid strategies for merging synthetic and empirical data without introducing adverse interactions.
- Extending input and model architectures to accommodate complex, multi-phase defect discovery patterns.
In summary, Deep Synthetic Cross-project SRGM constitutes a robust and extensible framework for software reliability growth modeling, blending established statistical techniques, synthetic data engineering, and deep temporal modeling to overcome longstanding practical barriers in reliability estimation.