Feature-driven Generalized Motif-based NB (FGMNB)

Updated 4 January 2026

The paper introduces FGMNB, which extends traditional motif-based Naïve Bayes with feature-driven ensemble learning to achieve state-of-the-art graph classification.
It quantifies motif capability via intrinsic AUC upper bounds and selects high-performing motif features to improve scalability and efficiency.
Experimental results on social bot detection and signed network prediction demonstrate superior accuracy and robustness compared to baseline methods.

Feature-driven Generalized Motif-based Naïve Bayes (FGMNB) is a theoretically grounded, high-dimensional motif-based discriminative learning framework for graph-structured classification tasks, notably social bot identification and link sign prediction. By extending classic motif-based Naïve Bayes models, FGMNB incorporates both structural heterogeneity and feature-driven ensemble learning, achieving state-of-the-art results across critical network benchmarks (Ran et al., 28 Dec 2025, Ran et al., 28 Dec 2025).

1. Motif Definitions and Feature Construction

FGMNB operates on graphs $G(N,L,W)$ , with node set $N$ , directed or undirected link set $L$ , and node-label or link-sign function $W$ . For social bot detection, $W: N \rightarrow \{h,b\}$ indicates human or bot nodes. For signed network inference, $W: L \rightarrow \{\pm1\}$ records positive or negative edge labels.

Motifs are defined as fixed-size subgraph patterns:

Homogeneous motifs ( $M_i$ ): Ignore node labels, considering only topology. For bot detection, 30 possible 3-node motifs are itemized as first-order, second-order, and closed structures, with one node marked as target.
Heterogeneous motifs ( $Y_j$ ): Refine $M_i$ by incorporating the assigned labels of neighbors, yielding 114 distinct labeled 3-node motif types. Each motif encodes topological pattern plus node-label context.
For sign prediction (Ran et al., 28 Dec 2025): 3- and 4-node motifs are constructed around target links $(A,B)$ whose sign is unlabeled; neighbor roles are explicitly distinguished.

Each instance of $Y_j$ is mapped to a binary feature (present/absent for a node or link), allowing construction of high-dimensional motif feature vectors ( $\mathbb{R}^m$ with $m=114$ for bot detection, $m=9$ for signed link prediction).

2. Generalized Naïve Bayes Likelihood Estimation

FGMNB formalizes the probability of a class assignment (bot/human, sign $\pm$ ) via motif-structured likelihood ratios. For motif types $Y_i$ and node $A$ (or link $\ell=(A,B)$ ):

Bayesian updating is applied:

$P(\text{bot}|S_i(A)) = \frac{P(\text{bot}) \cdot P(S_i(A)|\text{bot})}{P(S_i(A))}, \quad P(\text{human}|S_i(A)) = \frac{P(\text{human}) \cdot P(S_i(A)|\text{human})}{P(S_i(A))}$

Assuming conditional independence:

$r_A = \frac{P(\text{bot}|S_i(A))}{P(\text{human}|S_i(A))} = \frac{P(\text{bot})}{P(\text{human})} \prod_{(B,C) \in S_i(A)} \frac{P((B,C)|\text{bot})}{P((B,C)|\text{human})}$

For signed networks, role functions $r_1(A,B;M,S_i)$ and $r_2(A,B;M,S_i)$ differentiate direct and indirect neighbor impacts by counting motif instances directly connected to the target link versus those not touching $(A,B)$ . Laplace (+1) smoothing mitigates zero-count artifacts.

3. Motif Capability Quantification and Feature Selection

To theoretically bound the discriminative power of each motif feature, the maximum achievable AUC is established:

$\text{AUC}'_{\text{upper}} = p_1 + (1-p_1)p_2 + \frac{1}{2} (1-p_1)(1-p_2) = \frac{1}{2} + \frac{p_1 + p_2 - p_1p_2}{2}$

where $p_1$ is the fraction of bots (or positive links) possessing the motif and $p_2$ is the corresponding fraction for non-bots (or negatives). This upper bound measures the intrinsic separability induced by the feature, guiding selection of a motif subset with maximal classification utility.

Motif selection proceeds by sorting motifs by $\text{AUC}'_{\text{upper}}$ and retaining those above a chosen threshold (e.g., $\tau=0.7$ ). Computational costs scale as $O(m)$ for coverage estimation and $O(m \log m)$ for sorting, where $m$ is the motif type count.

4. Feature-driven Ensemble Learning Framework

FGMNB departs from traditional Naïve Bayes by leveraging high-dimensional motif features and data-driven weighting. The workflow:

For each instance (node or link), compute motif occurrence statistics and log-likelihood ratios per motif type, forming a feature vector.
Train an XGBoost classifier (alternately, Random Forest, Gradient Boosting) using cross-validated logistic loss, with grid search for hyperparameter optimization (learning rate, tree depth, regularization).
Class balance and early stopping are emphasized for robust convergence.

FGMNB enables arbitrary nonlinear interactions among features, in contrast with the additive structure of multivariate Naïve Bayes (GMMNB), substantially improving discriminative accuracy.

5. Experimental Architectures and Performance Benchmarking

Datasets

FGMNB is validated on multiple large-scale network datasets:

Social Bot Detection (Ran et al., 28 Dec 2025):

Dataset	Nodes ( $\|N\|$ )	Links ( $\|L\|$ )	Bots
Cresci-15	1,741	6,214	636
MGTAB	9,443	425,863	2,475
TwiBot-20	205,730	227,477	6,589
TwiBot-22	693,761	3,711,903	81,432

Signed Network Prediction (Ran et al., 28 Dec 2025):

Dataset	Nodes	Links	Pos/Neg (%)
BitcoinAlpha	3,783	14,124	90/10
BitcoinOTC	5,881	21,492	86/14
Wiki-RfA	11,221	171,761	77/23
Slashdot	82,140	500,481	76/24

Results

FGMNB outperforms all baseline models—including unsupervised (Ising, SybilWalk, SybilSCAR), feature-based (Botometer, FP, ARG), and deep methods (DeeProBot, T5, BotRGCN, RGT for bot detection; DNE-SBP, SGCN, SiGAT, SDGNN, SE-SGformer for sign prediction):

Dataset	Accuracy	Precision	Recall	F₁	AUC
Cresci-15	0.987±0.01	0.976±0.02	1.000	0.988	0.992±0.01
MGTAB	0.826±0.02	0.805±0.02	0.860	0.831	0.907±0.02
TwiBot-20	0.827±0.01	0.812±0.01	0.921	0.863	0.914±0.01
TwiBot-22	0.874±0.00	0.870±0.00	0.873	0.872	0.942±0.00

Signed Data	Best Baseline AUC	GMMNB AUC	FGMNB AUC
BitcoinAlpha	0.722±0.02	0.802±0.02	0.851±0.02
BitcoinOTC	0.836±0.01	0.903±0.03	0.920±0.01
Wiki-RfA	0.800±0.10	0.819±0.04	0.853±0.02
Slashdot	0.974±0.01	0.866±0.02	0.894±0.01

Selecting only high-capability motifs ( $\text{AUC}'_{\text{upper}}>0.7$ ) yields performance nearly identical to the full motif set, demonstrating efficient feature reduction.

6. Theoretical Insights and Consistency Properties

FGMNB achieves optimality under motif conditional independence, with ratio estimators converging to true class likelihoods as sample size increases. The framework generalizes Naïve Bayes at two levels: motif heterogeneity and nonlinear feature integration. The relaxation of additivity via ensemble methods such as XGBoost increases predictive accuracy, at the cost of interpretability.

Role function decomposition (direct vs. indirect neighbor impact) further corrects the homogeneity assumption of previous motif-based classifiers, enabling finer discrimination of local network structure (Ran et al., 28 Dec 2025).

7. Practical Implementation Guidelines

Motif enumeration is the principal computational bottleneck, with complexity scaling as $O(|N||k!|)$ for motif discovery and $O(m)$ for feature extraction, where $m$ is the number of motif types considered.

For training:

Use 10-fold cross-validation with balanced class splits.
Apply grid-search for hyperparameter selection (learning rate 0.05–0.2, tree depth 3–6, regularizations $\lambda=1-10$ ).
Precompute motif features in parallel or offline for scalability.
Feature selection guided by motif coverage and capability.

FGMNB extends trivially to accommodate auxiliary node or global features via feature concatenation, with downstream classification remaining efficient and scalable.

FGMNB synthesizes motif-based topological context with statistical learning for node- and link-level classification. By quantifying heterogeneous neighborhood preference and integrating theoretically bounded feature selection, it enables robust, scalable, and generalizable discrimination in complex networks (Ran et al., 28 Dec 2025, Ran et al., 28 Dec 2025).

PDF Markdown Chat (Pro)

References (2)

Identifying social bots via heterogeneous motifs based on Naïve Bayes model (2025)

A generalized motif-based Naïve Bayes model for sign prediction in complex networks (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Feature-driven Generalized Motif-based Naïve Bayes (FGMNB).