Generalized Motifs-Based Naïve Bayes Model
- The paper introduces a generalized Naïve Bayes framework that leverages multiple motif structures for accurate link sign prediction in signed graphs.
- It employs dual architectures—GMMNB and FGMNB—using role-based log-likelihood estimators to achieve superior AUC and accuracy over embedding-based methods.
- Motif coverage analysis highlights the practical significance of 3- and 4-node motifs in applications such as fraud detection and trust assessment.
A generalized multiple motifs-based Naïve Bayes model is a theoretically grounded probabilistic framework for predicting properties of complex networks—most notably, for link sign prediction in signed graphs. This approach systematically incorporates heterogeneous influences from local motif structures by quantifying differentiated roles of neighboring nodes or edges and aggregating information across multiple motif instances. Two principal architectures are used: a linear Naïve Bayes combination (GMMNB) treating motif-derived scores as independent evidence, and a feature-driven ensemble method (FGMNB) leveraging machine learning to integrate high-dimensional motif features for enhanced predictive performance. The methodology provides both interpretable motif-level statistics and robust, empirically-validated predictive accuracy, surpassing established embedding-based baselines in benchmark evaluations (Ran et al., 28 Dec 2025).
1. Motif Structures and Role Functions in Signed Networks
In an undirected signed graph where is the node set, is the set of links, and provides edge labels , a motif is defined as a small, connected subgraph whose arrangement of positive and negative edges is statistically overrepresented. For each candidate edge whose sign is to be inferred, the algorithm identifies all motif instances within a local window (typically covering 3- and 4-node configurations) that incorporate . This captures balance-theoretic and status-theoretic structural regularities.
The classical single-motif Naïve Bayes (SMNB) approach assumes uniform influence across all neighboring nodes in a motif. The generalized model instead distinguishes two roles:
- Common Link (CL): The neighbor is directly linked to either or .
- Common Node (CN): The neighbor is present in the motif but not directly linked to or .
For each motif type and neighbor occupying a consistent structural role, the algorithm computes separate role-based likelihood estimators:
where and are counts of positive and negative labelings of when is in the CL role, and similarly for CN (Ran et al., 28 Dec 2025).
2. Single-Motif Naïve Bayes Prediction
Given the set of all motif instances around for predictor , the model forms the posterior-odds score as a product of the role-based likelihood ratios:
This additive log-likelihood form encodes the cumulative evidence from all eligible motifs involving and its neighbors, modulated by their structural role within each motif (Ran et al., 28 Dec 2025).
3. Extension to Multiple Motifs: GMMNB and FGMNB
The model integrates features from multiple motifs via two main strategies:
- GMMNB (Generalized Multiple Motifs-based Naïve Bayes): Each of the distinct motif-derived predictors produces an individual log-likelihood score , and these are linearly combined:
Here, corrects for class imbalance, with , the fractions of positive and negative links (Ran et al., 28 Dec 2025).
- FGMNB (Feature-driven Generalized Motif-based Naïve Bayes): A 9-dimensional vector (motifs to ) is constructed for each candidate edge and passed into a machine-learning classifier (e.g., XGBoost) that can learn nonlinear feature interactions:
where represents an ensemble of regression trees, and is the logistic loss (Ran et al., 28 Dec 2025).
4. Algorithmic Workflow and Pseudocode
The FGMNB procedure applies the following protocol for sign prediction:
- Preprocess the signed network (removing ambiguous edges, ensuring undirectedness).
- For each train/test split:
- Construct balanced train/test sets by edge sampling.
- Extract pairs and enumerate all motif instances they participate in.
- Compute role-based log-likelihood ratios for each motif instance, forming .
- Train XGBoost on the resulting feature-label pairs.
- Apply the trained model to the test set and record performance.
- Repeat across multiple random splits to estimate mean and variance of metrics.
At inference, the same motif scoring and feature construction is applied, followed by prediction via the XGBoost ensemble (Ran et al., 28 Dec 2025).
| Step | Operation | Output |
|---|---|---|
| Preprocessing | Remove ambiguous/bidirectional edges, binarize labels | Cleaned graph |
| Motif extraction | Find all 3-/4-node motifs containing each | Motif instances |
| Feature gen. | Compute log-likelihood ratios from role-based counts | Feature vector |
| Training | XGBoost on | Trained classifier |
| Prediction | Apply classifier to test instances | Sign prediction |
5. Experimental Validation and Quantitative Performance
The approach was empirically evaluated on four large signed networks:
- BitcoinAlpha ($3,783$ nodes, $14,124$ edges, positive)
- BitcoinOTC ($5,881$ nodes, $21,492$ edges, positive)
- Wiki-RfA ($11,221$ nodes, $171,761$ edges, positive)
- Slashdot ($82,140$ nodes, $500,481$ edges, positive)
Balanced sampling (equal positive/negative links) ensures robust evaluation. Across 100 random splits, GMMNB outperforms all competing embedding-based baselines, and FGMNB further improves AUC and Accuracy. For instance, on BitcoinAlpha, FGMNB achieves AUC , outperforming DNE-SBP () and SGCN (). Notably, FGMNB’s margin exceeds AUC over baseline on BitcoinAlpha and on BitcoinOTC. Only on Slashdot does SE-SGformer achieve a higher AUC ($0.974$), but FGMNB generalizes better across networks (Ran et al., 28 Dec 2025).
| Method | BitcoinAlpha AUC | BitcoinOTC AUC | Wiki-RfA AUC | Slashdot AUC |
|---|---|---|---|---|
| DNE-SBP | 0.722 ± 0.02 | 0.836 ± 0.01 | 0.760 ± 0.03 | 0.812 ± 0.02 |
| GMMNB | 0.802 ± 0.02 | 0.903 ± 0.03 | 0.819 ± 0.04 | 0.866 ± 0.02 |
| FGMNB | 0.851 ± 0.02 | 0.920 ± 0.01 | 0.853 ± 0.02 | 0.894 ± 0.01 |
6. Motif Selection and Coverage Analysis
"Motif coverage" is defined as the fraction of all links in that participate in at least one instance of motif :
Empirically, AUC under GSMNB-CL increases monotonically with motif coverage: motifs with the highest coverage are typically most predictive. On BitcoinAlpha, the 4-node motif achieves AUC $0.814$; on BitcoinOTC, Wiki-RfA, and Slashdot, motif (bridge-plus-balanced-triangle) is dominant. A plausible implication is that practical feature engineering for sign prediction should prioritize motif types that maximize both coverage and discriminative AUC. The results further support the sufficiency of 3-node motifs in trust-dense environments, while 4-node motifs are essential in balanced community structures (Ran et al., 28 Dec 2025).
7. Implications and Practical Significance
The generalized multiple motifs-based Naïve Bayes model delivers four major contributions:
- A Naïve Bayes formulation with explicit modeling of heterogeneous neighbor influences via structurally-defined role functions.
- Integration of multiple motif-derived log-likelihood scores—either linearly (GMMNB) or via nonlinear feature-driven ensembles (FGMNB).
- Empirically validated superiority over embedding-based methods for sign prediction across large real-world signed graphs, with robust gains in both AUC and Accuracy.
- Motif coverage and discriminative power provide actionable metrics for motif selection—informing scalable feature engineering in network analysis and sign prediction tasks.
These findings provide a systematic and interpretable framework applicable to trust assessment, fraud detection, and other relational inference scenarios in complex networks (Ran et al., 28 Dec 2025).