G-OSR Benchmark
- G-OSR Benchmark is a comprehensive framework that defines evaluation protocols, datasets, and methods for detecting unseen classes and managing uncertainty.
- It integrates techniques such as post-hoc scoring, training-time regularization, and generative methods to optimize both known-class accuracy and unknown-class rejection.
- Empirical findings demonstrate enhanced performance using metrics like AUROC and FPR95 on image and graph benchmarks, highlighting the framework's robustness in real-world applications.
The G-OSR Benchmark refers to a family of evaluation protocols, datasets, and methodologies for assessing open-set recognition (OSR), open-world and out-of-distribution (OOD) detection, often under the "generalized" open-set setting. The G-OSR paradigm has evolved to incorporate recognition, detection, and uncertainty estimation under both standard and specialized setups, including images, graphs, and multimodal tasks. This article provides a comprehensive synthesis of representative G-OSR benchmarks with emphasis on their formal definitions, protocols, representative methods, and key findings as established in the literature.
1. Formal Problem Definitions
G-OSR tasks generalize classical closed-set classification by introducing "unseen" or "unknown" categories at test time. Let denote the set of labeled training classes and the set of classes exclusive to test time, with . The learning objective is twofold:
- Known-class decision: Maximize classification accuracy on samples where .
- Unknown-class detection: Detect and reject samples from with high precision.
For a model outputting a softmax vector over seen classes and an "unknown" score ( or analogous), the decision rule involves a threshold :
This formulation is canonical for both tabular/image (Wang et al., 29 Aug 2024, Yang et al., 2022, Engelbrecht et al., 2023, Geng et al., 2019) and graph-structured data (Dong et al., 1 Mar 2025).
2. Benchmark Design and Dataset Suites
G-OSR benchmarks are instantiated on a range of domains and data types:
- Images: Standard splits distinguish between in-distribution (ID) and OOD or unseen categories, e.g.,
- SVHN, CIFAR-10/100, ImageNet-1K/21K, CUB, AWA2, SUN, aPY (Wang et al., 29 Aug 2024, Engelbrecht et al., 2023, Yang et al., 2022, Geng et al., 2019).
- Fine-grained semantic shift (CUB-SSB, SCARS, FGVC-Aircraft, ImageNet-SSB).
- Graphs: Diverse node- and graph-level datasets, including Cora, Citeseer, Pubmed, Flickr, Amazon-Photo (node-level), and MUTAG, PROTEINS, NCI1, D&D, ENZYMES (graph-level) (Dong et al., 1 Mar 2025).
The benchmarks typically enforce:
- Training exclusively on .
- Test-time exposure to both seen and unseen classes, with the model tasked to both assign correct labels and detect/reject unknowns.
Semi-supervised and adversarial sampling regimes are explored, including synthetic "bad-looking" sample generation via GANs (Engelbrecht et al., 2023).
3. Evaluation Protocols and Metrics
Assessment is conducted on both classification and rejection performance using established metrics:
- AUROC: Area under the receiver operating characteristic curve, comparing the true positive rate (TPR) for known classes to the false positive rate (FPR) for unseen (Wang et al., 29 Aug 2024, Yang et al., 2022, Dong et al., 1 Mar 2025).
- FPR@95%TPR (FPR95): The FPR measured at the operating point where TPR is 95% (Yang et al., 2022, Dong et al., 1 Mar 2025).
- AUPR: Area under the precision-recall curve (unknown as positive) (Dong et al., 1 Mar 2025).
- Open-Set Classification Rate (OSCR): Integrates correct classification and rejection across thresholds (Wang et al., 29 Aug 2024).
- Harmonic Mean (): Combines seen-class accuracy and unknown-class (rejection) accuracy: (Geng et al., 2019).
- F1 for OSR: F1-score with "unknown" as positive (Dong et al., 1 Mar 2025).
Test splits are constructed to provide a balanced and rigorous quantification of both ID and OOD/unknown detection capability. Some benchmarks average over multiple random splits (Yang et al., 2022, Dong et al., 1 Mar 2025).
4. Methodological Taxonomy
Methods benchmarked under the G-OSR paradigm fall into distinct categories:
- Closed-Set Baselines: Classical classifiers (e.g., ResNet, GNN variants) with open-set decision via calibrated thresholding (Dong et al., 1 Mar 2025).
- Post-hoc Inference: Scoring methods such as MSP (max softmax probability), ODIN, Mahalanobis distance, Energy, KNN, MLS, and VIM; often requiring no retraining (Yang et al., 2022, Wang et al., 29 Aug 2024).
- Training-Time Regularization:
- Outlier Exposure (OE): Trains using auxiliary outlier samples for high-entropy softmax targets (Wang et al., 29 Aug 2024, Yang et al., 2022, Dong et al., 1 Mar 2025).
- Virtual Outlier Synthesis (VOS), ARPL (adversarial reciprocal points), OpenMax, OpenGAN (Yang et al., 2022, Wang et al., 29 Aug 2024).
- Generative Semi-supervised and Adversarial Methods:
- FM-GAN, ARP-GAN, Margin-GAN, Triple-GAN represent the synthesis of SSL and OSR under principled "bad-looking" sample regularization schemes (Engelbrecht et al., 2023).
- Graph-Specific GOSR: GraphDE (VAE-based), GSM (uncertainty/message-passing), OpenWGL (virtual outlier synthesis), GKDE, GPN (Dong et al., 1 Mar 2025).
- Semantic Prototype Guidance: VSG-CNN augments the decision boundary for OSR by leveraging semantic prototypes of seen classes (Geng et al., 2019).
A representative table summarizes the taxonomy:
| Method Family | Example Methods | Remarks |
|---|---|---|
| Post-hoc Scoring | MSP, KNN, Mahalanobis, Energy, VIM | Inference-only |
| Training-time Regularizer | OE, ARPL, OpenGAN, VOS, Virtual Outlier | Requires outlier data |
| Generative | FM-GAN, ARP-GAN, Triple-GAN, Margin-GAN | Adversarial, SSL |
| Graph-specific | GraphDE, GSM, OpenWGL | Structured data |
| Semantic Prototypes | VSG-CNN, CPL | Attribute-based, ZSL hybrid |
5. Representative Results and Empirical Findings
Empirical analysis reveals nuanced behavior depending on data scale, domain, and task split.
- Image Benchmarks:
- OE and MLS/Energy scoring excel on small datasets (CIFAR-10/100), with AUROC >99% in favorable settings (Wang et al., 29 Aug 2024, Yang et al., 2022).
- On ImageNet-scale and fine-grained splits (SSB, CUB-SSB), magnitude-aware scoring rules (MLS, Energy) show greater stability and effectiveness than OE, with cross-entropy baselines (CE+MLS/Energy) outperforming OE by 5–10 AUROC points (Wang et al., 29 Aug 2024).
- Data augmentation strategies (CutMix, Mixup, PixMix) can further improve G-OSR performance (Yang et al., 2022).
- Graph Benchmarks:
- GOSR-specific methods (GraphDE, GSM) outperform both closed-set GNNs and anomaly/GOODD baselines for AUROC, AUPR, and FPR95—with GraphDE showing AUROC up to 0.81 on graph-level data (Dong et al., 1 Mar 2025).
- As unseen-class ratio increases, all methods degrade, but GOSR methods are more robust (Dong et al., 1 Mar 2025).
- Generative Benchmarks: Margin-GAN achieves the best trade-off between closed-set accuracy and open-set AUROC, outperforming both FM-GAN and ARP-GAN in various SSL-OSR configurations (Engelbrecht et al., 2023).
- Semantic Prototype Guidance: VSG-CNN improves rejection accuracy and harmonic mean over visual-only prototypes (CPL), especially in fine-grained domains, and provides interpretable attribute predictions for rejected samples (Geng et al., 2019).
6. Theoretical and Methodological Insights
A key theoretical result is that "bad-looking" samples in SSL and OSR are formally analogous—both occupying the classifier’s complementary space and regularizing boundary regions (Engelbrecht et al., 2023). This underpins the unification of semi-supervised and open-set learning regimes.
Post-hoc scoring functions based on the magnitude or distributional properties of intermediate or output activations (e.g., KNN distances, Energy) are consistently competitive or superior, especially under limited or mismatched outlier exposure (Yang et al., 2022, Wang et al., 29 Aug 2024).
Semantic augmentation (prototypes, attributes) tightens decision boundaries and provides human-interpretable cues on "unknowns"—potentially aiding knowledge transfer or human–AI collaboration in open-world settings (Geng et al., 2019).
7. Open Problems and Future Directions
Despite progress, the G-OSR landscape presents several outstanding challenges and avenues for further research:
- Auxiliary Data Selection: The effectiveness of outlier exposure is highly sensitive to the choice and representativeness of auxiliary OOD data, particularly on large-scale or complex domains (Wang et al., 29 Aug 2024, Yang et al., 2022).
- Compositional and Dynamic Open-Set: Extension to evolving or compositional open sets, e.g., dynamic graphs, multi-label or hierarchical categories, remains nascent (Dong et al., 1 Mar 2025).
- Curated OOD Suites: Automatic synthesis or curation of OOD/outlier datasets for maximal generalizability is unsolved (Yang et al., 2022).
- Unified Evaluation: Many benchmarks report only AUROC and FPR95; comprehensive reporting (including OSCR, OAA, attribute-based rejection) is needed for fair comparison, especially in high-stakes domains.
- Continual and Incremental Methods: Mechanisms to add newly detected classes to and support ongoing open-world learning are an open area (Dong et al., 1 Mar 2025).
- Foundation Model Integration: The impact of CLIP/ViT and other foundation models on G-OSR capabilities is an active research area (Yang et al., 2022).
The G-OSR paradigm provides a rigorous empirical and theoretical basis for evaluating model reliability in unconstrained, open-world scenarios and remains central to advancing safe, robust machine learning systems (Engelbrecht et al., 2023, Wang et al., 29 Aug 2024, Yang et al., 2022, Dong et al., 1 Mar 2025, Geng et al., 2019).