Papers
Topics
Authors
Recent
2000 character limit reached

Feature-driven Generalized Motif-based NB (FGMNB)

Updated 4 January 2026
  • The paper introduces FGMNB, which extends traditional motif-based Naïve Bayes with feature-driven ensemble learning to achieve state-of-the-art graph classification.
  • It quantifies motif capability via intrinsic AUC upper bounds and selects high-performing motif features to improve scalability and efficiency.
  • Experimental results on social bot detection and signed network prediction demonstrate superior accuracy and robustness compared to baseline methods.

Feature-driven Generalized Motif-based Naïve Bayes (FGMNB) is a theoretically grounded, high-dimensional motif-based discriminative learning framework for graph-structured classification tasks, notably social bot identification and link sign prediction. By extending classic motif-based Naïve Bayes models, FGMNB incorporates both structural heterogeneity and feature-driven ensemble learning, achieving state-of-the-art results across critical network benchmarks (Ran et al., 28 Dec 2025, Ran et al., 28 Dec 2025).

1. Motif Definitions and Feature Construction

FGMNB operates on graphs G(N,L,W)G(N,L,W), with node set NN, directed or undirected link set LL, and node-label or link-sign function WW. For social bot detection, W:N{h,b}W: N \rightarrow \{h,b\} indicates human or bot nodes. For signed network inference, W:L{±1}W: L \rightarrow \{\pm1\} records positive or negative edge labels.

Motifs are defined as fixed-size subgraph patterns:

  • Homogeneous motifs (MiM_i): Ignore node labels, considering only topology. For bot detection, 30 possible 3-node motifs are itemized as first-order, second-order, and closed structures, with one node marked as target.
  • Heterogeneous motifs (YjY_j): Refine MiM_i by incorporating the assigned labels of neighbors, yielding 114 distinct labeled 3-node motif types. Each motif encodes topological pattern plus node-label context.
  • For sign prediction (Ran et al., 28 Dec 2025): 3- and 4-node motifs are constructed around target links (A,B)(A,B) whose sign is unlabeled; neighbor roles are explicitly distinguished.

Each instance of YjY_j is mapped to a binary feature (present/absent for a node or link), allowing construction of high-dimensional motif feature vectors (Rm\mathbb{R}^m with m=114m=114 for bot detection, m=9m=9 for signed link prediction).

2. Generalized Naïve Bayes Likelihood Estimation

FGMNB formalizes the probability of a class assignment (bot/human, sign ±\pm) via motif-structured likelihood ratios. For motif types YiY_i and node AA (or link =(A,B)\ell=(A,B)):

Bayesian updating is applied:

P(botSi(A))=P(bot)P(Si(A)bot)P(Si(A)),P(humanSi(A))=P(human)P(Si(A)human)P(Si(A))P(\text{bot}|S_i(A)) = \frac{P(\text{bot}) \cdot P(S_i(A)|\text{bot})}{P(S_i(A))}, \quad P(\text{human}|S_i(A)) = \frac{P(\text{human}) \cdot P(S_i(A)|\text{human})}{P(S_i(A))}

Assuming conditional independence:

rA=P(botSi(A))P(humanSi(A))=P(bot)P(human)(B,C)Si(A)P((B,C)bot)P((B,C)human)r_A = \frac{P(\text{bot}|S_i(A))}{P(\text{human}|S_i(A))} = \frac{P(\text{bot})}{P(\text{human})} \prod_{(B,C) \in S_i(A)} \frac{P((B,C)|\text{bot})}{P((B,C)|\text{human})}

For signed networks, role functions r1(A,B;M,Si)r_1(A,B;M,S_i) and r2(A,B;M,Si)r_2(A,B;M,S_i) differentiate direct and indirect neighbor impacts by counting motif instances directly connected to the target link versus those not touching (A,B)(A,B). Laplace (+1) smoothing mitigates zero-count artifacts.

3. Motif Capability Quantification and Feature Selection

To theoretically bound the discriminative power of each motif feature, the maximum achievable AUC is established:

AUCupper=p1+(1p1)p2+12(1p1)(1p2)=12+p1+p2p1p22\text{AUC}'_{\text{upper}} = p_1 + (1-p_1)p_2 + \frac{1}{2} (1-p_1)(1-p_2) = \frac{1}{2} + \frac{p_1 + p_2 - p_1p_2}{2}

where p1p_1 is the fraction of bots (or positive links) possessing the motif and p2p_2 is the corresponding fraction for non-bots (or negatives). This upper bound measures the intrinsic separability induced by the feature, guiding selection of a motif subset with maximal classification utility.

Motif selection proceeds by sorting motifs by AUCupper\text{AUC}'_{\text{upper}} and retaining those above a chosen threshold (e.g., τ=0.7\tau=0.7). Computational costs scale as O(m)O(m) for coverage estimation and O(mlogm)O(m \log m) for sorting, where mm is the motif type count.

4. Feature-driven Ensemble Learning Framework

FGMNB departs from traditional Naïve Bayes by leveraging high-dimensional motif features and data-driven weighting. The workflow:

  1. For each instance (node or link), compute motif occurrence statistics and log-likelihood ratios per motif type, forming a feature vector.
  2. Train an XGBoost classifier (alternately, Random Forest, Gradient Boosting) using cross-validated logistic loss, with grid search for hyperparameter optimization (learning rate, tree depth, regularization).
  3. Class balance and early stopping are emphasized for robust convergence.

FGMNB enables arbitrary nonlinear interactions among features, in contrast with the additive structure of multivariate Naïve Bayes (GMMNB), substantially improving discriminative accuracy.

5. Experimental Architectures and Performance Benchmarking

Datasets

FGMNB is validated on multiple large-scale network datasets:

Social Bot Detection (Ran et al., 28 Dec 2025):

Dataset Nodes (N|N|) Links (L|L|) Bots
Cresci-15 1,741 6,214 636
MGTAB 9,443 425,863 2,475
TwiBot-20 205,730 227,477 6,589
TwiBot-22 693,761 3,711,903 81,432

Signed Network Prediction (Ran et al., 28 Dec 2025):

Dataset Nodes Links Pos/Neg (%)
BitcoinAlpha 3,783 14,124 90/10
BitcoinOTC 5,881 21,492 86/14
Wiki-RfA 11,221 171,761 77/23
Slashdot 82,140 500,481 76/24

Results

FGMNB outperforms all baseline models—including unsupervised (Ising, SybilWalk, SybilSCAR), feature-based (Botometer, FP, ARG), and deep methods (DeeProBot, T5, BotRGCN, RGT for bot detection; DNE-SBP, SGCN, SiGAT, SDGNN, SE-SGformer for sign prediction):

Dataset Accuracy Precision Recall F₁ AUC
Cresci-15 0.987±0.01 0.976±0.02 1.000 0.988 0.992±0.01
MGTAB 0.826±0.02 0.805±0.02 0.860 0.831 0.907±0.02
TwiBot-20 0.827±0.01 0.812±0.01 0.921 0.863 0.914±0.01
TwiBot-22 0.874±0.00 0.870±0.00 0.873 0.872 0.942±0.00
Signed Data Best Baseline AUC GMMNB AUC FGMNB AUC
BitcoinAlpha 0.722±0.02 0.802±0.02 0.851±0.02
BitcoinOTC 0.836±0.01 0.903±0.03 0.920±0.01
Wiki-RfA 0.800±0.10 0.819±0.04 0.853±0.02
Slashdot 0.974±0.01 0.866±0.02 0.894±0.01

Selecting only high-capability motifs (AUCupper>0.7\text{AUC}'_{\text{upper}}>0.7) yields performance nearly identical to the full motif set, demonstrating efficient feature reduction.

6. Theoretical Insights and Consistency Properties

FGMNB achieves optimality under motif conditional independence, with ratio estimators converging to true class likelihoods as sample size increases. The framework generalizes Naïve Bayes at two levels: motif heterogeneity and nonlinear feature integration. The relaxation of additivity via ensemble methods such as XGBoost increases predictive accuracy, at the cost of interpretability.

Role function decomposition (direct vs. indirect neighbor impact) further corrects the homogeneity assumption of previous motif-based classifiers, enabling finer discrimination of local network structure (Ran et al., 28 Dec 2025).

7. Practical Implementation Guidelines

Motif enumeration is the principal computational bottleneck, with complexity scaling as O(Nk!)O(|N||k!|) for motif discovery and O(m)O(m) for feature extraction, where mm is the number of motif types considered.

For training:

  • Use 10-fold cross-validation with balanced class splits.
  • Apply grid-search for hyperparameter selection (learning rate 0.05–0.2, tree depth 3–6, regularizations λ=110\lambda=1-10).
  • Precompute motif features in parallel or offline for scalability.
  • Feature selection guided by motif coverage and capability.

FGMNB extends trivially to accommodate auxiliary node or global features via feature concatenation, with downstream classification remaining efficient and scalable.


FGMNB synthesizes motif-based topological context with statistical learning for node- and link-level classification. By quantifying heterogeneous neighborhood preference and integrating theoretically bounded feature selection, it enables robust, scalable, and generalizable discrimination in complex networks (Ran et al., 28 Dec 2025, Ran et al., 28 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Feature-driven Generalized Motif-based Naïve Bayes (FGMNB).