Unknown Recall (UR) in Open-World Detection

Updated 4 March 2026

Unknown Recall (UR) is a metric that evaluates a model's ability to detect items not observed during training, crucial for open-world object detection and retrieval.
The metric leverages techniques such as geometric pseudo-labels, superclass grouping, and concept decomposition to improve detection performance and balance precision across classes.
UR is applied in critical fields like autonomous driving, surveillance, and remote sensing, enabling robust identification of novel or unlabeled objects while managing trade-offs with known-class accuracy.

Unknown Recall (UR), commonly encountered as "Unknown Recall" or "U-Recall" in open-world object detection (OWOD) and retrieval contexts, measures a model's ability to retrieve or detect items belonging to classes that were not present in the model's training phase or are unlabeled/hidden during evaluation. This metric has become central for quantifying progress on open-world recognition and retrieval, particularly as systems are increasingly required to operate beyond closed, fixed-label scenarios and to robustly identify novel or unannotated items. In addition to its standard role in OWOD benchmarks, the concept of unknown recall has been explored with retrieval metrics, focusing on scenarios where the total set of relevant items is itself unknown or fluid.

1. Formal Definition and Metric Computation

Unknown Recall is rigorously defined in multiple OWOD studies. For object detection, U-Recall quantifies the fraction of ground-truth objects labeled as "unknown" (i.e., not in the model’s list of known classes) that are correctly returned by the model as “unknown.” The canonical formula is:

$\mathrm{U\!-\!Recall} = \frac{N_{\mathrm{unk}^{+}}}{N_{\mathrm{unk}}}$

where $N_{\mathrm{unk}^{+}}$ is the number of true unknown instances detected as unknown, and $N_{\mathrm{unk}}$ is the total number of unknown objects in the test set (Yavuz et al., 2024, Lv et al., 24 Feb 2026). For object detection, a predicted box is considered a true positive unknown if it (a) is labeled “unknown,” (b) matches an unmatched ground-truth unknown class object with IoU ≥ 0.5, and (c) is not already assigned to a known class prediction (Yavuz et al., 2024).

In retrieval scenarios with dynamic or evolving knowledge bases, the notion of recall faces the challenge that the denominator—number of relevant items $N_p$ —may be entirely unknown. Here, the traditional recall

$R = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Relevant}|}$

is no longer computable, motivating a suite of proxies and recall-free measures for assessing performance under “unknown recall” conditions (Schwartz et al., 24 Dec 2025).

2. Unknown Recall in Open-World Object Detection

Unknown Recall is a principal metric for benchmarking OWOD systems, where training proceeds incrementally over subsets of classes, and the “unknown” class encapsulates any object outside the labeled training categories. Datasets (such as COCO split into phases) expose the model at each stage to a growing set of known categories, while requiring active discovery of previously unseen (“unknown”) classes. U-Recall is computed per evaluation split and task (Yavuz et al., 2024, Lv et al., 24 Feb 2026).

Crucially, methods that leverage canonical appearance-based cues can achieve baseline U-Recall, but geometric cues or explicit “unknown” scoring mechanisms—such as odd-one-out criteria or concept decomposition—are necessary for achieving state-of-the-art performance (Yavuz et al., 2024, Lv et al., 24 Feb 2026). In the O1O method, structuring known classes into superclasses and applying an odd-one-out scoring rule—based on the sum of superclass recalibrated class probabilities—enables marked gains in U-Recall while restoring or even boosting known-class average precision (mAP) (Yavuz et al., 2024).

The IPOW framework with Concept Decomposition Model (CDM) further advances U-Recall by dividing object features into Discriminative, Shared, and Background “concept” subspaces. Discriminative concepts separate known classes, Shared concepts capture transferable attributes that can signal novelty, and Background concepts distinguish object from non-object regions. The Concept-Guided Rectification (CGR) mechanism corrects for known-unknown confusion, resulting in substantial increases to U-Recall and reductions in open-set error (Lv et al., 24 Feb 2026).

3. Applications and Trade-offs

Unknown Recall is central to applications requiring robust open-set recognition, such as autonomous driving, surveillance, and remote sensing, where models must not only detect labeled classes but also surface “unknowns” that may require human review or trigger downstream discovery.

A salient trade-off surfaces between U-Recall and known-class precision: approaches that aggressively label many proposals as “unknown” can trivially raise U-Recall, but induce unacceptable collapse in known-class accuracy (measured as mAP or similar). To counterbalance, state-of-the-art approaches enforce post-NMS cap constraints (e.g., ≤ 100 proposals/image), thresholding strategies, and balance-optimizing design (as in superclass grouping or CGR) (Yavuz et al., 2024, Lv et al., 24 Feb 2026).

Empirical results indicate that geometric pseudo-labels and explicit handling of semantic structure are critical for achieving high U-Recall without sacrificing reliability on the known set (Yavuz et al., 2024).

Method/Mechanism	U-Recall Gain	Known mAP/Confusion
Appearance cues, no pseudo-labels	Baseline	Baseline
Geometric pseudo-labels (GOOD RPN)	+19 ppt	Drops mAP, increases confusion
Superclass grouping	≈maintain	Restores mAP
Concept decomposition + CGR (IPOW)	+7 to +11 ppt	Reduces known-unknown errors

4. Unknown Recall in Retrieval

In IR and retrieval, “Unknown Recall” refers to the infeasibility of computing standard recall due to an unknown set of relevant items in the corpus. Approaches to circumvent this include:

Fixed-benchmark evaluation: Use static test collections with full relevance judgments; computes true recall and harmonic $F$ -measures, but lacks scalability.
Estimate- $N_p$ protocols: Replace the unknown denominator by crude estimates, e.g., labeling the top 2K for relevance to approximate the reach of recall ( $F_e$ ), but at the risk of bias.
Recall-free measures: Deploy metrics like nDCG or the top- $K$ selection measure $T_\alpha = (1-\alpha)n_p-\alpha\,n_n/K$ , which bypass the unknown denominator and robustly track downstream LLM response quality (Schwartz et al., 24 Dec 2025).

Experimental evidence shows that the top- $N_{\mathrm{unk}^{+}}$ 0 selection measure matches or outperforms the classical $N_{\mathrm{unk}^{+}}$ 1-measure in correlation with LLM-rated answer quality, except in extremely high $N_{\mathrm{unk}^{+}}$ 2 regimes where position in the ranking becomes critical (Schwartz et al., 24 Dec 2025).

5. Practical Implications and Limitations

Optimizing and interpreting Unknown Recall requires detailed attention to dataset splits, class increment schedules, and evaluation protocols. The possibility of trivial gains (e.g., always predicting “unknown”) is controlled via post-processing (NMS, cap constraints), balanced losses, and evidence aggregation.

Furthermore, interpretability frameworks such as IPOW with CDM and CGR provide both higher U-Recall and transparent rationale for predictions; activations in Discriminative and Shared concept spaces supply fine-grained explanations for why objects are classified as “unknown” or not, and help to diagnose and suppress persistent known–unknown confusion (Lv et al., 24 Feb 2026).

A current limitation in IR evaluation is that measures relying on top-K estimates or recall-free approaches may become unreliable when the number of relevant documents or K drastically increases. Experimentation has so far been limited to small-to-moderate relevant set sizes ( $N_{\mathrm{unk}^{+}}$ 3), with future work needed for true large-scale deployments (Schwartz et al., 24 Dec 2025).

Unknown Recall is closely tracked alongside other open-set performance metrics:

Wilderness Impact (WI): Measures the confusion between known and unknown predictions, penalizing false known predictions on unknowns (Lv et al., 24 Feb 2026).
Absolute Open-Set Error (A-OSE): Tallies the absolute number of misclassified unknowns as knowns (Lv et al., 24 Feb 2026).

Comparative studies establish that approaches such as O1O and IPOW provide significant absolute and relative gains in U-Recall (up to +11.6pt, ≈+37% relative) over prior methods, without undermining known-class accuracy. Ablation studies corroborate that Shared concept learning is essential for high U-Recall, while additional architectural and rectification strategies control the known–unknown trade-off (Yavuz et al., 2024, Lv et al., 24 Feb 2026).

7. Summary and Research Frontiers

Unknown Recall has crystallized as the central metric for open-world detection and retrieval under uncertainty. Advances in feature representation (superclass grouping, geometric cues, concept decomposition), score calibration, and interpretability have produced substantial analytic and practical gains.

Prospective research directions include the extension of recall-free retrieval metrics to domains with much larger numbers of relevant items, integration of ranking-sensitive modifications, and more sophisticated balance of unknown detection and known-class discrimination. The interpretability insights enabled by frameworks such as IPOW open paths toward model debugging, auditing, and robust deployment in high-stakes settings (Yavuz et al., 2024, Schwartz et al., 24 Dec 2025, Lv et al., 24 Feb 2026).