Papers
Topics
Authors
Recent
2000 character limit reached

G-OSR Benchmark

Updated 25 November 2025
  • G-OSR Benchmark is a comprehensive framework that defines evaluation protocols, datasets, and methods for detecting unseen classes and managing uncertainty.
  • It integrates techniques such as post-hoc scoring, training-time regularization, and generative methods to optimize both known-class accuracy and unknown-class rejection.
  • Empirical findings demonstrate enhanced performance using metrics like AUROC and FPR95 on image and graph benchmarks, highlighting the framework's robustness in real-world applications.

The G-OSR Benchmark refers to a family of evaluation protocols, datasets, and methodologies for assessing open-set recognition (OSR), open-world and out-of-distribution (OOD) detection, often under the "generalized" open-set setting. The G-OSR paradigm has evolved to incorporate recognition, detection, and uncertainty estimation under both standard and specialized setups, including images, graphs, and multimodal tasks. This article provides a comprehensive synthesis of representative G-OSR benchmarks with emphasis on their formal definitions, protocols, representative methods, and key findings as established in the literature.

1. Formal Problem Definitions

G-OSR tasks generalize classical closed-set classification by introducing "unseen" or "unknown" categories at test time. Let Cseen\mathcal{C}_{\mathrm{seen}} denote the set of labeled training classes and Cunseen\mathcal{C}_{\mathrm{unseen}} the set of classes exclusive to test time, with CseenCunseen=\mathcal{C}_{\mathrm{seen}} \cap \mathcal{C}_{\mathrm{unseen}} = \emptyset. The learning objective is twofold:

  • Known-class decision: Maximize classification accuracy on samples xx where yCseeny \in \mathcal{C}_{\mathrm{seen}}.
  • Unknown-class detection: Detect and reject samples from Cunseen\mathcal{C}_{\mathrm{unseen}} with high precision.

For a model outputting a softmax vector over seen classes and an "unknown" score (sunk(x)=1maxksk(x)s_{\text{unk}}(x) = 1 - \max_k s_k(x) or analogous), the decision rule involves a threshold τ\tau:

y^={argmaxkCseensk(x),if maxksk(x)τ “unknown”,otherwise\hat{y} = \begin{cases} \arg\max_{k \in \mathcal{C}_{\mathrm{seen}}} s_k(x), & \text{if } \max_k s_k(x) \geq \tau \ \text{``unknown''}, & \text{otherwise} \end{cases}

This formulation is canonical for both tabular/image (Wang et al., 29 Aug 2024, Yang et al., 2022, Engelbrecht et al., 2023, Geng et al., 2019) and graph-structured data (Dong et al., 1 Mar 2025).

2. Benchmark Design and Dataset Suites

G-OSR benchmarks are instantiated on a range of domains and data types:

The benchmarks typically enforce:

  • Training exclusively on Cseen\mathcal{C}_{\mathrm{seen}}.
  • Test-time exposure to both seen and unseen classes, with the model tasked to both assign correct labels and detect/reject unknowns.

Semi-supervised and adversarial sampling regimes are explored, including synthetic "bad-looking" sample generation via GANs (Engelbrecht et al., 2023).

3. Evaluation Protocols and Metrics

Assessment is conducted on both classification and rejection performance using established metrics:

Test splits are constructed to provide a balanced and rigorous quantification of both ID and OOD/unknown detection capability. Some benchmarks average over multiple random splits (Yang et al., 2022, Dong et al., 1 Mar 2025).

4. Methodological Taxonomy

Methods benchmarked under the G-OSR paradigm fall into distinct categories:

A representative table summarizes the taxonomy:

Method Family Example Methods Remarks
Post-hoc Scoring MSP, KNN, Mahalanobis, Energy, VIM Inference-only
Training-time Regularizer OE, ARPL, OpenGAN, VOS, Virtual Outlier Requires outlier data
Generative FM-GAN, ARP-GAN, Triple-GAN, Margin-GAN Adversarial, SSL
Graph-specific GraphDE, GSM, OpenWGL Structured data
Semantic Prototypes VSG-CNN, CPL Attribute-based, ZSL hybrid

5. Representative Results and Empirical Findings

Empirical analysis reveals nuanced behavior depending on data scale, domain, and task split.

  • Image Benchmarks:
    • OE and MLS/Energy scoring excel on small datasets (CIFAR-10/100), with AUROC >99% in favorable settings (Wang et al., 29 Aug 2024, Yang et al., 2022).
    • On ImageNet-scale and fine-grained splits (SSB, CUB-SSB), magnitude-aware scoring rules (MLS, Energy) show greater stability and effectiveness than OE, with cross-entropy baselines (CE+MLS/Energy) outperforming OE by 5–10 AUROC points (Wang et al., 29 Aug 2024).
    • Data augmentation strategies (CutMix, Mixup, PixMix) can further improve G-OSR performance (Yang et al., 2022).
  • Graph Benchmarks:
    • GOSR-specific methods (GraphDE, GSM) outperform both closed-set GNNs and anomaly/GOODD baselines for AUROC, AUPR, and FPR95—with GraphDE showing AUROC up to 0.81 on graph-level data (Dong et al., 1 Mar 2025).
    • As unseen-class ratio increases, all methods degrade, but GOSR methods are more robust (Dong et al., 1 Mar 2025).
  • Generative Benchmarks: Margin-GAN achieves the best trade-off between closed-set accuracy and open-set AUROC, outperforming both FM-GAN and ARP-GAN in various SSL-OSR configurations (Engelbrecht et al., 2023).
  • Semantic Prototype Guidance: VSG-CNN improves rejection accuracy and harmonic mean HH over visual-only prototypes (CPL), especially in fine-grained domains, and provides interpretable attribute predictions for rejected samples (Geng et al., 2019).

6. Theoretical and Methodological Insights

A key theoretical result is that "bad-looking" samples in SSL and OSR are formally analogous—both occupying the classifier’s complementary space and regularizing boundary regions (Engelbrecht et al., 2023). This underpins the unification of semi-supervised and open-set learning regimes.

Post-hoc scoring functions based on the magnitude or distributional properties of intermediate or output activations (e.g., KNN distances, Energy) are consistently competitive or superior, especially under limited or mismatched outlier exposure (Yang et al., 2022, Wang et al., 29 Aug 2024).

Semantic augmentation (prototypes, attributes) tightens decision boundaries and provides human-interpretable cues on "unknowns"—potentially aiding knowledge transfer or human–AI collaboration in open-world settings (Geng et al., 2019).

7. Open Problems and Future Directions

Despite progress, the G-OSR landscape presents several outstanding challenges and avenues for further research:

  • Auxiliary Data Selection: The effectiveness of outlier exposure is highly sensitive to the choice and representativeness of auxiliary OOD data, particularly on large-scale or complex domains (Wang et al., 29 Aug 2024, Yang et al., 2022).
  • Compositional and Dynamic Open-Set: Extension to evolving or compositional open sets, e.g., dynamic graphs, multi-label or hierarchical categories, remains nascent (Dong et al., 1 Mar 2025).
  • Curated OOD Suites: Automatic synthesis or curation of OOD/outlier datasets for maximal generalizability is unsolved (Yang et al., 2022).
  • Unified Evaluation: Many benchmarks report only AUROC and FPR95; comprehensive reporting (including OSCR, OAA, attribute-based rejection) is needed for fair comparison, especially in high-stakes domains.
  • Continual and Incremental Methods: Mechanisms to add newly detected classes to Cseen\mathcal{C}_{\mathrm{seen}} and support ongoing open-world learning are an open area (Dong et al., 1 Mar 2025).
  • Foundation Model Integration: The impact of CLIP/ViT and other foundation models on G-OSR capabilities is an active research area (Yang et al., 2022).

The G-OSR paradigm provides a rigorous empirical and theoretical basis for evaluating model reliability in unconstrained, open-world scenarios and remains central to advancing safe, robust machine learning systems (Engelbrecht et al., 2023, Wang et al., 29 Aug 2024, Yang et al., 2022, Dong et al., 1 Mar 2025, Geng et al., 2019).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to G-OSR Benchmark.