Granule Density Outlier Factor (GDOF)

Updated 28 December 2025

GDOF is a density-based outlier detection framework that uses fuzzy granulation and multiscale analysis to identify anomalies in both homogeneous and mixed datasets.
It combines attribute-level fuzzy similarity with density estimates to compute an interpretable outlier score sensitive to local and global sparsity.
The method supports unsupervised and semi-supervised regimes and achieves state-of-the-art performance on diverse benchmark datasets.

The Granule Density-based Outlier Factor (GDOF) is a flexible and theoretically grounded framework for outlier detection that integrates fuzzy set-based granulation, density estimation, and multiscale ensemble strategies. GDOF systematically combines attribute-level fuzzy granules to identify samples in locally or globally sparse regions of the data, thereby flagging them as potential outliers. The method supports both unsupervised and semi-supervised regimes, natively handles heterogeneous and mixed-type attributes, and achieves state-of-the-art accuracy across a variety of domains (Gao et al., 6 Jan 2025, Chen et al., 21 Dec 2025).

1. Mathematical Foundations of GDOF

Consider an information system or dataset $U$ with $n$ samples and attribute set $A$ . GDOF builds on fuzzy rough set theory, representing each sample by a vector of fuzzy similarities and estimating its density relative to the remainder of the data.

Fuzzy Similarity:

Given attribute $a\in A$ and normalized values $f_i^a$ , the fuzzy similarity $R_a(x_i,x_j)$ is typically defined as

For numerical $a$ : $R_a(x_i,x_j)=1 - |f_i^a - f_j^a|/\varepsilon$ (if $|f_i^a - f_j^a|\leq \varepsilon$ ), otherwise $0$, where $\varepsilon = \mathrm{std}(a)/\delta$ .
For categorical $a$ : $R_a(x_i, x_j) = 1$ if $f_i^a = f_j^a$ , $0$ otherwise.

Fuzzy Granule and Density:

The fuzzy granule of $x$ under $a$ is the $n$ -vector: $[x]_a = \langle R_a(x, x_1), \dots, R_a(x, x_n)\rangle$ , with granule cardinality $|[x]_a| = \sum_{j} R_a(x, x_j)$ and normalized density $\operatorname{Den}_a(x) = |[x]_a|/n$ .

Relative Density Adjustment:

For local density adaptation, $\mathrm{Rel\_Den}_a(x_i, x_j) = \exp\big(-\lambda\,(\operatorname{Den}_a(x_i) - \operatorname{Den}_a(x_j))^2\big)$ with $\lambda > 0$ . The adjusted similarity is $\widetilde{R}_a(x_i,x_j) = R_a(x_i,x_j) \times \mathrm{Rel\_Den}_a(x_i, x_j)$ .

Attribute Set Conjunction and Significance:

For $B \subseteq A$ , define combined similarity via the conjunction: $\widetilde{R}_B(x_i, x_j) = \min_{a\in B} \widetilde{R}_a(x_i, x_j)$ , and cardinals $|\widetilde{[x_i]}_B| = \sum_j \widetilde{R}_B(x_i, x_j)$ . The granulation significance is $\operatorname{Sig}(B) = -\log\left(\sum_{x\in U} |\widetilde{[x]}_B|/n\right)$ .

GDOF Outlier Score:

Given a chain $A_1\subset A_2\subset\dots\subset A_m=A$ sorted by descending $\operatorname{Sig}$ , define the Granule Density-based Outlier Factor (“GDOF”): $S(x) = 1 - \frac{1}{m} \sum_{i=1}^m \left[\operatorname{Sig}(A_i) \cdot \frac{|\widetilde{[x]}_{A_i}|}{n}\right]$ A higher $S(x) \in [0,1]$ indicates higher likelihood of being an outlier (Gao et al., 6 Jan 2025).

2. Algorithmic Workflow and Multiscale Integration

GDOF’s core is extensible, supporting multiscale, ensemble-based outlier detection via granular-ball decomposition and view fusion. The generalized algorithm involves:

Multi-Scale View Generation:
- Start from the finest partition (each point is a granular-ball).
- Iteratively merge balls based on fuzzy similarity until a single ball remains.
- Each partition at a granularity level forms a scale $k$ , denoted $\mathrm{GBS}_k$ .
Within-Scale Scoring:
- For each granular-ball, treat as a super-sample; compute GDOF scores $S_k(x)$ for all constituent samples.
- Scores are mapped to probabilities $P_k(x)$ via a two-sided linear transform.
Ensemble Fusion and Thresholding:
- Fuse view-specific probabilities: $P(x) = \frac{\sum_k \nu_k P_k(x)}{\sum_k \nu_k}$ , where view weights $\nu_k = 1 - \text{avg}_x\, H(P_k(x))$ (binary entropy).

Three-way decision partition:
- $\text{POS} = \{x \mid P(x) \geq \alpha\}$
- $\text{NEG} = \{x \mid P(x) \leq \beta\}$
- $\text{BND} =$ remainder

SVM-Based Refinement:
- Train a weighted SVM on POS (outlier) and NEG (inlier) with sample weights $\mu(x) = 1 - \text{avg}_k \nu_k H(P_k(x))$ .
- Platt-scale SVM outputs for BND to deliver final outlier probabilities (Gao et al., 6 Jan 2025).

Pseudocode encapsulating this workflow:

Input: U={x₁…xₙ}, A, λ, δ, t
1. GBSV = generateGranularBalls(U, A, δ)
2. For each scale k:
      Compute Sₖ(x) for all x using density-enhanced granules
      Map Sₖ to Pₖ(x)
      νₖ = 1 – (1/n) ∑ₓ H(Pₖ(x))
3. Fuse: P(x) = ∑ₖ νₖPₖ(x)/∑ₖνₖ
   Compute thresholds α, β; assign POS, NEG, BND
4. Train SVM on POS(+1), NEG(–1), sample weights μ(x)
5. For x in BND, Platt-scale SVM outputs → ŴP(x)
Output: ŴP(x)

A more attribute-centric, label-informed GDOF is developed in (Chen et al., 21 Dec 2025), optimizing per-attribute fuzzy radii for discrimination between (few) labeled outliers and (sampled) inliers, and forming an outlier score as a weighted sum over attributes’ granule densities.

3. Computational Complexity and Parameterization

The time and space complexity are governed by pairwise operations and the number of attributes:

Single-view FRS+GDOF: $O(|A| n^2)$ (mainly from similarity and granule construction)
Multi-scale granular-ball generation: $O(n (\log n)^2)$ ; number of scales $K = O(\log n)$ empirically
SVM refinement: $O(|A| n^2)$ (for SMO-like solvers)
Overall: $O(|A| n^2 \log n)$ time, $O(n^2)$ memory

Parameterization:

$\delta$ regulates the neighborhood window $\varepsilon_a = \mathrm{std}(a)/\delta$
$\lambda \geq 0$ tunes the impact of local density contrast in similarity modulation
Thresholds $t$ and margin parameter $\Delta$ define the three-way division
For label-informed GDOF, per-attribute $\lambda^k$ are optimized to enhance density separation between outliers and inliers (Gao et al., 6 Jan 2025, Chen et al., 21 Dec 2025)

In practice, sparsity in the similarity matrices and small $\lambda$ offer further computational savings.

4. Illustrative Example

Consider five 1D samples normalized to $[0,1]$ : $x = \{0.1, 0.15, 0.2, 0.8, 0.85\}$ . Choose $\varepsilon=0.3$ (so $\delta$ is tuned accordingly):

Compute $R_a(x_i,x_j)$ :

Fill the $5 \times 5$ similarity matrix, e.g., $R_a(0.1,0.15) = 1 - 0.05/0.3 \approx 0.833$ .

Granule cardinality:

$|[x_1]_a| \approx 1 + 0.833 + 0.667 + 0 + 0 = 2.5$ , then $\operatorname{Den}_a(x_1) = 2.5/5 = 0.5$ .

Relative density:

E.g., $\mathrm{Rel\_Den}_a(x_1,x_4) = \exp(-\lambda(0.5-0.5)^2) = 1$ .

Adjusted similarity and GDOF score:

$\widetilde{R}_a(x_i,x_j) = R_a(x_i,x_j) \cdot \mathrm{Rel\_Den}_a(x_i, x_j)$ ; single-attribute significance $\operatorname{Sig}(\{a\}) = -\log \left( \sum_{i} \operatorname{Den}_a(x_i) \right)$ ; final score, e.g., $S(x_i) = 1 - \operatorname{Sig}(\{a\}) \cdot \operatorname{Den}_a(x_i)$ .

5. Theoretical Properties and Interpretability

GDOF provides principled density estimates both globally and locally:

In dense clusters, for all $x_i, x_j$ with $|f_i^a - f_j^a| < \delta$ , $GD_a(x) \in ((1-\delta)^2, 1/(1-\delta))$ (Prop. 1, (Chen et al., 21 Dec 2025)).
Adding a distant point to a neighborhood reduces $GD_a(x)$ for points in the cluster (Prop. 2), so GDOF is sensitive to sparsity increases.

GDOF is thus well-aligned with classical notions of density-based outliers, with the added advantage of attribute-wise decomposition and the ability to natively process heterogeneity and fuzziness.

A plausible implication is that GDOF enables interpretability: attributes with high discriminatory power for outliers are explicitly weighted, and outlier scores are directly connected to fuzzy local densities. Multiscale and ensemble aspects further mitigate sensitivity to scale and cluster structure.

6. Empirical Performance

On 20 benchmark datasets from UCI, MVTec-AD, and OD-bench, GDOF and its multiscale variants consistently achieve state-of-the-art performance:

Unsupervised/multiscale GDOF (Gao et al., 6 Jan 2025):
- Average AUROC $\approx 0.873$
- Outperforms single-view FRS ( $\approx 0.75$ ), LOF ( $\approx 0.72$ ), kNN ( $\approx 0.79$ ), CBLOF ( $\approx 0.79$ ), isolation forest ( $\approx 0.80$ )
- Minimum 8.5% AUROC gain over the best non-ensemble baseline (Friedman+Nemenyi $p<0.05$ )
Label-informed GDOF (Chen et al., 21 Dec 2025):
- Mean AUC $= 0.849$ , best competitor $= 0.789$
- In mixed/categorical datasets: +10–15% AUC versus state of the art
- AP rises from $0.542$ (next best) to $0.594$
- Performance is robust as the number of pseudo-inliers $N_-$ varies (50–500); label-efficiency is high with gains saturating after 5–30 labeled outliers

These results confirm GDOF’s strong empirical validity for both classical and challenging mixed-type datasets.

7. Extensions: Heterogeneous and Label-Informed GDOF

GDOF extends to heterogeneous data by granularizing each attribute according to its type and optimizing fuzzy radius $\lambda^k$ for best separation between a small set of labeled outliers and putative or given inliers. The final outlier score is a weighted sum of attribute densities, where attribute weights are tied to their relevance (difference of average density for inliers minus outliers) (Chen et al., 21 Dec 2025).

A plausible implication is that GDOF adapts gracefully to domains with mixed numerical, ordinal, and categorical data, and leverages modest amounts of labeled anomaly data to prioritize informative features. Negative sampling strategies allow usage in settings where inlier labels are scarce or absent.

References:

"Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls" (Gao et al., 6 Jan 2025)
"Label-Informed Outlier Detection Based on Granule Density" (Chen et al., 21 Dec 2025)