Just-In-Time Software Defect Prediction

Updated 2 December 2025

JIT-SDP is a predictive analytics discipline that identifies bug-inducing commits using features like code churn, process metrics, and developer history.
It employs methods from hand-crafted statistical models to state-of-the-art deep and transformer-based architectures for accurate defect prediction.
The approach addresses challenges such as class imbalance, concept drift, and label noise through robust dataset construction and continuous model retraining.

Just-In-Time Software Defect Prediction (JIT-SDP) is a predictive analytics discipline focused on identifying software changes—typically at the granularity of individual commits—that are likely to introduce bugs, at or immediately after their integration into a codebase. This fine-grained, temporally immediate approach leverages features derived from code churn, process metrics, developer history, and, increasingly, rich semantic representations learned from code and text. Research in JIT-SDP has advanced from hand-crafted feature engineering to sophisticated deep and hybrid architectures, as well as robust statistical methods for handling class imbalance, concept drift, and data labeling noise.

1. Problem Formulation and Core Concepts

JIT-SDP models operate at the change (commit) level, predicting whether a single revision will introduce a bug. This sharply contrasts with traditional defect-prediction models, which assign risk to files or modules over longer release cycles (Keshavarz et al., 2022). Mathematically, given a stream of commits $C_1,\ldots,C_n$ , where each $C_i$ is characterized by a multidimensional feature vector $\mathbf{x}_i$ (capturing code, process, social, and contextual attributes), the goal is to learn a function $f(\mathbf{x}_i) \to [0,1]$ that yields the probability $p_i$ that $C_i$ is bug-inducing ( $y_i = 1$ ) or not ( $y_i = 0$ ).

Canonical metrics include:

Code change metrics: Lines added (LA), lines deleted (LD), number of files (NF), directories (ND), subsystems (NS) touched, change entropy $H = -\sum (c_i/S)\log_2(c_i/S)$ .
Process and historical metrics: Age, number of unique prior changes (NUC), prior developer experience (EXP, REXP, SEXP), and prior distinct developers (NDev).
Effort and diffusion metrics: Change dispersion across files and codebase structure.

Labels are typically generated via the SZZ algorithm, which traces fixing commits back to defect-inducing changes through annotation and diff analysis. This introduces inherent label noise, as SZZ is imperfect in practice (Keshavarz et al., 2022, Ng et al., 2021, Nam et al., 11 Sep 2025).

2. Dataset Construction and Preprocessing

Large-scale, rigorously processed datasets are essential for effective JIT-SDP model development and evaluation. ApacheJIT (Keshavarz et al., 2022) is exemplary: it comprises 106,674 commits (28,239 bug-inducing, 78,435 clean) from 14 Apache projects. Construction involves:

Bug- and clean-commit mining: Map Jira "Bug-Fixed" issues to commits by parsing commit messages for issue references. Use the latest reference as the fixing commit.
SZZ-based bug-inducing commit labeling: For each fixing commit, use git annotate on all deleted lines to link to historical commits, then apply robust temporal, statistical, and semantic filters (e.g., outlier commit size, trivial changes via AST diffing).
Class imbalance: Only 26% of commits are labeled bug-inducing, demanding careful handling in model selection and training.
Feature extraction and cleaning: Codify all metrics for each commit; apply log transforms and correlation-based feature-pruning where appropriate (Ng et al., 2021). Outliers and trivial code changes are filtered to improve label precision.

Key datasets: ApacheJIT [Java, multi-project], ReDef [(Nam et al., 11 Sep 2025), C/C++, revert-based, function-level, GPT-assisted triage, over 90% label precision], Kamei et al. (2013), Trac JIT (Ng et al., 2021).

3. Modeling Approaches and Learning Architectures

Hand-crafted and Machine Learning Models

Traditional methods build random forests, logistic regression, or SVMs over hand-crafted metrics (Keshavarz et al., 2022, Sahar et al., 2022, Bryan et al., 2021, Jahanshahi et al., 2021). Such models are robust, interpretable, and competitive—especially for small-to-medium datasets or weakly informative features. Feature importance can be surfaced using Gini importance, permutation, integrated gradients, and SHAP values (Haldar et al., 7 Nov 2024).

Deep and Hybrid Methods

DeepJIT and successors leverage CNN/LSTM architectures over tokenized code and commit message sequences, automatically learning latent semantic features (Ng et al., 2021, Zhou et al., 17 Mar 2024). Transformer-based architectures (CodeBERT, CodeT5+, UniXCoder) have become state-of-the-art, offering end-to-end representation learning from code diffs and natural language (Jiang et al., 15 Oct 2024, Guo et al., 2023). Bi-modal and multi-modal transformers—such as BiCC-BERT, SimCom++, MMTrans-JIT—integrate code diffs, commit messages, and tabular metrics to enhance predictive power and semantic coverage (Zhou et al., 17 Mar 2024, Jiang et al., 15 Oct 2024, Mohammad et al., 28 Feb 2025).

Model	F1	AUC
DeepJIT	0.293	0.775
CodeBERTJIT	0.42	0.80
JIT-Fine	0.431	0.881
JIT-BiCC	0.478	0.887

Concept Drift, Class Imbalance, and Incremental Learning

Modern JIT-SDP frameworks must address non-stationary data. Concept drift—nontrivial changes in the joint distribution $P_t(X, y)$ —degrades static model accuracy over the course of long-lived software projects (Chitsazian et al., 2023, Zhao et al., 2023, Nam et al., 11 Sep 2025). Advanced methods include:

Instance interpretation drift detection: Unsupervised windowed analysis of model explanation vectors to trigger retraining or adaptation (Chitsazian et al., 2023).
Chronologically aware and incremental learning: Regular retraining with sliding windows, weighted sampling favoring recent data, and forecast+classification deep architectures (e.g., CPI-JIT/DeepICP: LSTM-based, with temporal sequence encoding and SMOTE-PC for concept-preserving class balancing) (Zhao et al., 2023, Jahanshahi et al., 2021).
Continual fine-tuning of LLMs: Approaches such as CodeFlowLM employ continual fine-tuning, maintaining model relevance without scratch retraining (Monteiro et al., 28 Nov 2025).

4. Evaluation Protocols, Metrics, and Calibration

Experimental setups typically include:

Validation schemes: Within-project cross-validation (chronologically split), cross-project evaluation, and time-wise train-past/test-future splits.
Performance metrics:
- Precision $P=\frac{TP}{TP+FP}$
- Recall $R=\frac{TP}{TP+FN}$
- $F_1$ score $=2\frac{PR}{P+R}$
- ROC-AUC, PR-AUC, MCC, Brier score
- Effort-aware metrics (Recall@20% LOC, Initial False Alarms)
Calibration assessment: Reliable probability calibration (ECE, MCE, reliability diagrams) is critical to avoid over-/under-confident signals and allow reliable defect-risk ranking for triage (Shahini et al., 16 Apr 2025).

Model	ECE (OS/QT)	MCE (OS/QT)	Brier (OS/QT)
DeepJIT	35/33%	66/67%	24/19%
CodeBERT4JIT	12/8%	70/74%	12/8%
LApredict	9/3%	21/99%	15/12%

Post-hoc calibration (Platt scaling, Temperature scaling) can substantially reduce miscalibration for deep models.

5. Recent Advances: Multimodality, Graph Representation, and Online/HITL Systems

Multimodal and Hybrid Learning

State-of-the-art JIT-SDP models fuse information from code tokens, commit messages, and structured tabular features (Jiang et al., 15 Oct 2024, Mohammad et al., 28 Feb 2025). For instance, BiCC-BERT and MMTrans-JIT incorporate both commit message and code diff tokens, contextual tabular features (e.g., developer identifiers), and process them via transformer encoders and cross-modal attention or gating.

Graph-Based Methods

Contribution-graph-based ML reframes JIT-SDP as a problem of edge (change) classification in the developer–file network. Features based on node centrality, community labels, and node2vec embeddings yield remarkably higher F1 and MCC compared to classic tabular approaches, achieving F1 up to 77.55% (+152% relative over baselines) (Bryan et al., 2021).

Online, Human-in-the-Loop, and Information Retrieval Approaches

Online and HITL JIT-SDP: Deploy prequential (test-then-train) learning with real-time QA feedback. HITL O-JIT-SDP integrates SQA staff validation for high-confidence labeling and supports continuous statistical testing with k-fold distributed bootstrap and Wilcoxon tests; this improves average G-mean by ~8% (Liu et al., 2023).
IR-based JIT-SDP: IRJIT leverages BM25 similarity between new and historical code changes, predicting defectiveness via $K$ -NN majority vote. This model is orders of magnitude faster than deep learners and provides line-level risk rankings for explainability, without any need for periodic retraining (Sahar et al., 2022).

6. Granularity, Feature Engineering, and Limitations

JIT-SDP remains an inherently coarse-grained task in most deployed pipelines—defect labels are assigned per-commit, not per-file or per-hunk. This leads to high inspection effort, as flagged commits often span numerous unrelated files (Ng et al., 2021). Calls for fine-grained models advocate post-hoc feature extraction from full diffs and hierarchical prediction architectures able to rank files or hunks within commits.

Feature engineering continues to evolve. State-of-the-art models benefit from combining size, diffusion, workflow, AST-change, and review process metrics, with statistically significant boosts in MCC and effort-aware recall when using extended feature sets (Bludau et al., 2022). Feature importance analyses (integrated gradients, SHAP) consistently highlight code churn (LA, LD), entropy, and developer experience as dominant predictors (Haldar et al., 7 Nov 2024).

Label noise, stemming from SZZ-reliant labeling and poor handling of dormant or cross-commit bugs, remains a major challenge. Recent datasets such as ReDef employ revert-based anchors and multi-round LLM triage to achieve >90% label precision (Nam et al., 11 Sep 2025).

7. SOTA Benchmarks, Practical Implications, and Future Directions

Table: Selected SOTA Quantitative Results (F1/AUC, classification mode)

Approach	F1	AUC	Notes
JIT-BiCC	.478	.887	Bi-modal, code+msg+features (Jiang et al., 15 Oct 2024)
DeepJIT	.293	.775	CNN/LSTM, code+msg (Jiang et al., 15 Oct 2024)
CodeBERTJIT	.42	.81	Transformer, code+msg (Guo et al., 2023)
Graph-Based-XGB	.78	—	Developer–file graph (Bryan et al., 2021)
IRJIT	.60	—	IR, commit-level (Sahar et al., 2022)

JIT-SDP systems should:

Regularly retrain or adapt models to account for drift in project/team structure and feature importances (Jahanshahi et al., 2021, Zhao et al., 2023, Chitsazian et al., 2023).
Fuse semantic and expert-crafted features for maximal recall, precision, and robustness (Zhou et al., 17 Mar 2024, Jiang et al., 15 Oct 2024, Mohammad et al., 28 Feb 2025).
Employ highly curated datasets (e.g., ReDef) or robust labeling/triage to mitigate SZZ noise (Nam et al., 11 Sep 2025).
Integrate model outputs into real-time code review and triage workflows, providing both risk scores and interpretable explanations to developers (Zhou et al., 17 Mar 2024, Sahar et al., 2022, Liu et al., 2023).

Open research questions include edit-semantic representation learning (to move beyond shallow diff cues), task generalization to new domains (e.g., feature-level, AST/hunk-level, cross-language), automated and explainable calibration, and self-supervised, multimodal architectures for extreme scarcity or online adaptation scenarios.

JIT-SDP constitutes a rapidly evolving research field at the intersection of empirical software engineering and representation learning, with increasing scope for deep learning, explainability, online adaptation, and human-centric improvement.