Papers
Topics
Authors
Recent
2000 character limit reached

Granule Density Outlier Factor (GDOF)

Updated 28 December 2025
  • GDOF is a density-based outlier detection framework that uses fuzzy granulation and multiscale analysis to identify anomalies in both homogeneous and mixed datasets.
  • It combines attribute-level fuzzy similarity with density estimates to compute an interpretable outlier score sensitive to local and global sparsity.
  • The method supports unsupervised and semi-supervised regimes and achieves state-of-the-art performance on diverse benchmark datasets.

The Granule Density-based Outlier Factor (GDOF) is a flexible and theoretically grounded framework for outlier detection that integrates fuzzy set-based granulation, density estimation, and multiscale ensemble strategies. GDOF systematically combines attribute-level fuzzy granules to identify samples in locally or globally sparse regions of the data, thereby flagging them as potential outliers. The method supports both unsupervised and semi-supervised regimes, natively handles heterogeneous and mixed-type attributes, and achieves state-of-the-art accuracy across a variety of domains (Gao et al., 6 Jan 2025, Chen et al., 21 Dec 2025).

1. Mathematical Foundations of GDOF

Consider an information system or dataset UU with nn samples and attribute set AA. GDOF builds on fuzzy rough set theory, representing each sample by a vector of fuzzy similarities and estimating its density relative to the remainder of the data.

Fuzzy Similarity:

Given attribute aAa\in A and normalized values fiaf_i^a, the fuzzy similarity Ra(xi,xj)R_a(x_i,x_j) is typically defined as

  • For numerical aa: Ra(xi,xj)=1fiafja/εR_a(x_i,x_j)=1 - |f_i^a - f_j^a|/\varepsilon (if fiafjaε|f_i^a - f_j^a|\leq \varepsilon), otherwise $0$, where ε=std(a)/δ\varepsilon = \mathrm{std}(a)/\delta.
  • For categorical aa: Ra(xi,xj)=1R_a(x_i, x_j) = 1 if fia=fjaf_i^a = f_j^a, $0$ otherwise.

Fuzzy Granule and Density:

The fuzzy granule of xx under aa is the nn-vector: [x]a=Ra(x,x1),,Ra(x,xn)[x]_a = \langle R_a(x, x_1), \dots, R_a(x, x_n)\rangle, with granule cardinality [x]a=jRa(x,xj)|[x]_a| = \sum_{j} R_a(x, x_j) and normalized density Dena(x)=[x]a/n\operatorname{Den}_a(x) = |[x]_a|/n.

Relative Density Adjustment:

For local density adaptation, Rel_Dena(xi,xj)=exp(λ(Dena(xi)Dena(xj))2)\mathrm{Rel\_Den}_a(x_i, x_j) = \exp\big(-\lambda\,(\operatorname{Den}_a(x_i) - \operatorname{Den}_a(x_j))^2\big) with λ>0\lambda > 0. The adjusted similarity is R~a(xi,xj)=Ra(xi,xj)×Rel_Dena(xi,xj)\widetilde{R}_a(x_i,x_j) = R_a(x_i,x_j) \times \mathrm{Rel\_Den}_a(x_i, x_j).

Attribute Set Conjunction and Significance:

For BAB \subseteq A, define combined similarity via the conjunction: R~B(xi,xj)=minaBR~a(xi,xj)\widetilde{R}_B(x_i, x_j) = \min_{a\in B} \widetilde{R}_a(x_i, x_j), and cardinals [xi]~B=jR~B(xi,xj)|\widetilde{[x_i]}_B| = \sum_j \widetilde{R}_B(x_i, x_j). The granulation significance is Sig(B)=log(xU[x]~B/n)\operatorname{Sig}(B) = -\log\left(\sum_{x\in U} |\widetilde{[x]}_B|/n\right).

GDOF Outlier Score:

Given a chain A1A2Am=AA_1\subset A_2\subset\dots\subset A_m=A sorted by descending Sig\operatorname{Sig}, define the Granule Density-based Outlier Factor (“GDOF”): S(x)=11mi=1m[Sig(Ai)[x]~Ain]S(x) = 1 - \frac{1}{m} \sum_{i=1}^m \left[\operatorname{Sig}(A_i) \cdot \frac{|\widetilde{[x]}_{A_i}|}{n}\right] A higher S(x)[0,1]S(x) \in [0,1] indicates higher likelihood of being an outlier (Gao et al., 6 Jan 2025).

2. Algorithmic Workflow and Multiscale Integration

GDOF’s core is extensible, supporting multiscale, ensemble-based outlier detection via granular-ball decomposition and view fusion. The generalized algorithm involves:

  1. Multi-Scale View Generation:
    • Start from the finest partition (each point is a granular-ball).
    • Iteratively merge balls based on fuzzy similarity until a single ball remains.
    • Each partition at a granularity level forms a scale kk, denoted GBSk\mathrm{GBS}_k.
  2. Within-Scale Scoring:
    • For each granular-ball, treat as a super-sample; compute GDOF scores Sk(x)S_k(x) for all constituent samples.
    • Scores are mapped to probabilities Pk(x)P_k(x) via a two-sided linear transform.
  3. Ensemble Fusion and Thresholding:
    • Fuse view-specific probabilities: P(x)=kνkPk(x)kνkP(x) = \frac{\sum_k \nu_k P_k(x)}{\sum_k \nu_k}, where view weights νk=1avgxH(Pk(x))\nu_k = 1 - \text{avg}_x\, H(P_k(x)) (binary entropy).
  • Three-way decision partition:
    • POS={xP(x)α}\text{POS} = \{x \mid P(x) \geq \alpha\}
    • NEG={xP(x)β}\text{NEG} = \{x \mid P(x) \leq \beta\}
    • BND=\text{BND} = remainder
  1. SVM-Based Refinement:
    • Train a weighted SVM on POS (outlier) and NEG (inlier) with sample weights μ(x)=1avgkνkH(Pk(x))\mu(x) = 1 - \text{avg}_k \nu_k H(P_k(x)).
    • Platt-scale SVM outputs for BND to deliver final outlier probabilities (Gao et al., 6 Jan 2025).

Pseudocode encapsulating this workflow:

1
2
3
4
5
6
7
8
9
10
11
Input: U={x₁…xₙ}, A, λ, δ, t
1. GBSV = generateGranularBalls(U, A, δ)
2. For each scale k:
      Compute Sₖ(x) for all x using density-enhanced granules
      Map Sₖ to Pₖ(x)
      νₖ = 1 – (1/n) ∑ₓ H(Pₖ(x))
3. Fuse: P(x) = ∑ₖ νₖPₖ(x)/∑ₖνₖ
   Compute thresholds α, β; assign POS, NEG, BND
4. Train SVM on POS(+1), NEG(–1), sample weights μ(x)
5. For x in BND, Platt-scale SVM outputs → ŴP(x)
Output: ŴP(x)

A more attribute-centric, label-informed GDOF is developed in (Chen et al., 21 Dec 2025), optimizing per-attribute fuzzy radii for discrimination between (few) labeled outliers and (sampled) inliers, and forming an outlier score as a weighted sum over attributes’ granule densities.

3. Computational Complexity and Parameterization

The time and space complexity are governed by pairwise operations and the number of attributes:

  • Single-view FRS+GDOF: O(An2)O(|A| n^2) (mainly from similarity and granule construction)
  • Multi-scale granular-ball generation: O(n(logn)2)O(n (\log n)^2); number of scales K=O(logn)K = O(\log n) empirically
  • SVM refinement: O(An2)O(|A| n^2) (for SMO-like solvers)
  • Overall: O(An2logn)O(|A| n^2 \log n) time, O(n2)O(n^2) memory

Parameterization:

  • δ\delta regulates the neighborhood window εa=std(a)/δ\varepsilon_a = \mathrm{std}(a)/\delta
  • λ0\lambda \geq 0 tunes the impact of local density contrast in similarity modulation
  • Thresholds tt and margin parameter Δ\Delta define the three-way division
  • For label-informed GDOF, per-attribute λk\lambda^k are optimized to enhance density separation between outliers and inliers (Gao et al., 6 Jan 2025, Chen et al., 21 Dec 2025)

In practice, sparsity in the similarity matrices and small λ\lambda offer further computational savings.

4. Illustrative Example

Consider five 1D samples normalized to [0,1][0,1]: x={0.1,0.15,0.2,0.8,0.85}x = \{0.1, 0.15, 0.2, 0.8, 0.85\}. Choose ε=0.3\varepsilon=0.3 (so δ\delta is tuned accordingly):

  • Compute Ra(xi,xj)R_a(x_i,x_j):

Fill the 5×55 \times 5 similarity matrix, e.g., Ra(0.1,0.15)=10.05/0.30.833R_a(0.1,0.15) = 1 - 0.05/0.3 \approx 0.833.

  • Granule cardinality:

[x1]a1+0.833+0.667+0+0=2.5|[x_1]_a| \approx 1 + 0.833 + 0.667 + 0 + 0 = 2.5, then Dena(x1)=2.5/5=0.5\operatorname{Den}_a(x_1) = 2.5/5 = 0.5.

  • Relative density:

E.g., Rel_Dena(x1,x4)=exp(λ(0.50.5)2)=1\mathrm{Rel\_Den}_a(x_1,x_4) = \exp(-\lambda(0.5-0.5)^2) = 1.

  • Adjusted similarity and GDOF score:

R~a(xi,xj)=Ra(xi,xj)Rel_Dena(xi,xj)\widetilde{R}_a(x_i,x_j) = R_a(x_i,x_j) \cdot \mathrm{Rel\_Den}_a(x_i, x_j); single-attribute significance Sig({a})=log(iDena(xi))\operatorname{Sig}(\{a\}) = -\log \left( \sum_{i} \operatorname{Den}_a(x_i) \right); final score, e.g., S(xi)=1Sig({a})Dena(xi)S(x_i) = 1 - \operatorname{Sig}(\{a\}) \cdot \operatorname{Den}_a(x_i).

5. Theoretical Properties and Interpretability

GDOF provides principled density estimates both globally and locally:

  • In dense clusters, for all xi,xjx_i, x_j with fiafja<δ|f_i^a - f_j^a| < \delta, GDa(x)((1δ)2,1/(1δ))GD_a(x) \in ((1-\delta)^2, 1/(1-\delta)) (Prop. 1, (Chen et al., 21 Dec 2025)).
  • Adding a distant point to a neighborhood reduces GDa(x)GD_a(x) for points in the cluster (Prop. 2), so GDOF is sensitive to sparsity increases.

GDOF is thus well-aligned with classical notions of density-based outliers, with the added advantage of attribute-wise decomposition and the ability to natively process heterogeneity and fuzziness.

A plausible implication is that GDOF enables interpretability: attributes with high discriminatory power for outliers are explicitly weighted, and outlier scores are directly connected to fuzzy local densities. Multiscale and ensemble aspects further mitigate sensitivity to scale and cluster structure.

6. Empirical Performance

On 20 benchmark datasets from UCI, MVTec-AD, and OD-bench, GDOF and its multiscale variants consistently achieve state-of-the-art performance:

  • Unsupervised/multiscale GDOF (Gao et al., 6 Jan 2025):
    • Average AUROC 0.873\approx 0.873
    • Outperforms single-view FRS (0.75\approx 0.75), LOF (0.72\approx 0.72), kNN (0.79\approx 0.79), CBLOF (0.79\approx 0.79), isolation forest (0.80\approx 0.80)
    • Minimum 8.5% AUROC gain over the best non-ensemble baseline (Friedman+Nemenyi p<0.05p<0.05)
  • Label-informed GDOF (Chen et al., 21 Dec 2025):
    • Mean AUC =0.849= 0.849, best competitor =0.789= 0.789
    • In mixed/categorical datasets: +10–15% AUC versus state of the art
    • AP rises from $0.542$ (next best) to $0.594$
    • Performance is robust as the number of pseudo-inliers NN_- varies (50–500); label-efficiency is high with gains saturating after 5–30 labeled outliers

These results confirm GDOF’s strong empirical validity for both classical and challenging mixed-type datasets.

7. Extensions: Heterogeneous and Label-Informed GDOF

GDOF extends to heterogeneous data by granularizing each attribute according to its type and optimizing fuzzy radius λk\lambda^k for best separation between a small set of labeled outliers and putative or given inliers. The final outlier score is a weighted sum of attribute densities, where attribute weights are tied to their relevance (difference of average density for inliers minus outliers) (Chen et al., 21 Dec 2025).

A plausible implication is that GDOF adapts gracefully to domains with mixed numerical, ordinal, and categorical data, and leverages modest amounts of labeled anomaly data to prioritize informative features. Negative sampling strategies allow usage in settings where inlier labels are scarce or absent.


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Granule Density-based Outlier Factor (GDOF).