Class-Agnostic Mask Proposal Generation

Updated 1 June 2026

Class-Agnostic Mask Proposal Generation is a technique that generates candidate object masks without relying on predefined class labels.
It enables robust few-shot, open-vocabulary, and weakly supervised segmentation by decoupling mask creation from semantic classification.
Methodologies range from transformer-based mask decoders to superpixel refinement, achieving high instance recall and quality metrics.

Class-agnostic mask proposal generation is a foundational technique for object segmentation and scene parsing, where the objective is to generate candidate masks that delineate possible object regions without relying on category or class information. Unlike class-specific proposal systems, class-agnostic approaches disentangle mask generation from semantic classification, aiming to capture regions likely to correspond to real-world instances—including those of unseen or rare categories. These methods are crucial in applications ranging from few-shot segmentation, weakly supervised learning, open-vocabulary segmentation, medical imaging, and 3D scene understanding, and have become a core component in state-of-the-art vision pipelines.

1. Core Principles and Motivation

The primary rationale for class-agnostic mask proposal generation is to enable versatility and generalization. By training or designing segmenters that ignore class semantics, these systems can:

Enable robust few-shot and open-vocabulary segmentation, where novel object classes may be encountered at test time and must be matched or classified downstream (Jiao et al., 2022).
Provide strong coverage of true object boundaries even in the absence of full supervision (“objectness” cue rather than category membership), facilitating downstream tasks such as refinement, merging, or open-ended instance retrieval (Xie et al., 2021, Wang et al., 2021).
Avoid the tight coupling between geometry and recognition losses that frequently limits generalization in end-to-end class-specific heads (Shahabodini et al., 26 May 2025, Xie et al., 19 Nov 2025).

Class-agnostic mask proposal generators are typically evaluated with metrics that measure instance coverage (recall at IoU thresholds), segmentation quality (mIoU, Dice), and proposal efficiency, independently of class labels.

2. Architectural Approaches

Multiple architectural paradigms have emerged for class-agnostic mask proposal generation across 2D and 3D modalities:

Transformer-based Generators

Modern proposal generators frequently adopt a mask-query-based transformer head, as in Mask2Former/OneFormer-style designs. These:

Extract multi-scale CNN or Transformer features, generate a fixed set of learnable mask queries, and iteratively refine them through transformer decoder layers (Jiao et al., 2022, Shahabodini et al., 26 May 2025).
Output N dense masks (e.g., N=100–250), with each mask produced solely by a mask head (linear projection of the query embedding onto the high-res feature map with sigmoid activation).
Omit any per-proposal classification head to maintain strict category agnosticism.

For example, the MM-Former’s Potential Objects Segmenter (POS) uses a ResNet-50 backbone with frozen weights, three-layer transformer decoder with N mask queries, and outputs 100 class-agnostic soft masks per image, trained only with a dice loss and optimal mask-to-GT assignment via the Hungarian algorithm (Jiao et al., 2022). Leading segmentation adapters such as ViT-P simply employ a frozen, pre-trained mask proposal generator (e.g. OneFormer), discard class outputs, and post-process the resulting masks (Shahabodini et al., 26 May 2025).

Superpixel and Graph-based Methods

The use of superpixels as elementary regions allows for the refinement of coarse proposal masks:

Classic approaches like DeepFH extend the Felzenszwalb–Huttenlocher algorithm by incorporating deep semantic features in the edge weights for oversegmentation (Wilms et al., 2021).
A shallow deep network is trained to produce discriminative features per superpixel, improving the boundary alignment with true object contours over solely color-based criteria (Wilms et al., 2021).
Superpixel-based refinement modules then pool backbone and mask-prior features per superpixel, train a binary classifier over each, and reassemble masks by thresholding predictions and aggregating superpixels (Wilms et al., 2021, Wilms et al., 2021).

This pipeline is effective in recovering thin structures lost in the downsampling of CNN backbones and increases AR (average recall) and boundary metrics versus vanilla CNN proposals.

Binary Mask Heads in R-CNN Variants

Legacy but still effective approaches adapt Mask R-CNN architectures with the following modifications:

Merge all semantic classes into a single binary “foreground” during training, producing class-agnostic proposals (Luiten et al., 2018).
The region proposal and mask heads are trained with standard objectness, bbox, and binary mask losses over all object instances.
Downstream, these class-agnostic masks can be refined or merged based on geometric or temporal consistency.

This approach generalizes well to tasks like video object segmentation and unsupervised object discovery (Luiten et al., 2018).

Class-Agnostic Prior Modules and Knowledge Mining

Class-agnostic “objectness” priors can be automatically generated by mining correlations between a set of base-class prototypes and the query features:

The Class-agnostic Knowledge Mining Module in JC²A computes weighted-average embeddings from base-class instances, then correlates them over query features to generate a heat-map highlighting object regions, used as an extra input to the segmentation decoder (Huang et al., 2022).
This design can be fused with class-aware prototypes, acting as a soft objectness prior improved via shared training loss.

Weakly-supervised and Saliency-driven Mask Proposals

Methods for weak/box-supervision or limited-annotation settings leverage class-agnostic object localization cues from auxiliary sources:

BoxCaseg trains a segmentation model jointly on box-annotated and salient single-object images, relying on multi-instance learning and tightness priors, and applies geometric merging/dropping post-hoc for mask selection (Wang et al., 2021).
Bi-level EM-style optimization can alternately train a segmentation model and a pseudo-mask generator on an auxiliary dataset, then deploy the learned generator to produce proposals on a target weakly-supervised set (Xie et al., 2021).

3. Loss Functions and Training Protocols

Class-agnostic generators are typically trained with mask/geometry-focused losses:

Dice loss or IoU for mask-region matching: ensuring proposed masks align with annotated regions without class dependence (Jiao et al., 2022, Shahabodini et al., 26 May 2025, Xie et al., 19 Nov 2025).
Pixelwise binary cross-entropy for foreground/background segmentation (Xie et al., 19 Nov 2025, Wilms et al., 2021).
Hungarian matching (permutation-invariant assignment) for one-to-one matching between proposals and ground-truths, with no class term in the cost (Jiao et al., 2022, Shahabodini et al., 26 May 2025).
For weakly supervised and pseudo-label settings, MIL (multiple instance learning), tightness, and auxiliary consistency constraints are utilized to refine mask quality (Wang et al., 2021, Xie et al., 2021).

Typical architectures fully omit any class or category prediction branch; if included (as in ablations), the category head impairs class-agnostic recall and segmentation mIoU (Jiao et al., 2022).

4. 3D and Spatiotemporal Class-Agnostic Proposals

With the rise of 2D mask foundation models, class-agnostic mask proposals are now widely used to bootstrap proposal generation in 3D scenes and video:

Open3DIS aggregates 2D open-vocabulary masks from multi-frame observations, lifts them into geometrically aligned 3D superpoint clusters, and merges these via hierarchical agglomeration, yielding proposals that efficiently localize even ambiguous/small objects (Nguyen et al., 2023).
Any3DIS and related frameworks employ a 3D-aware mask tracking module, selecting pivot views for stable initialization and dynamically tracking consistent 2D proposals across frames. A dynamic programming module then refines final 3D proposals, eliminating redundancy and fragmentation inherent to view-independent lifting approaches (Nguyen et al., 2024).
Granularity-consistency-enforced frameworks propagate 2D masks across sequences, suppressing over-segmentation and maintaining uniform candidate detail, with proposals subsequently used as pseudo-labels to train 3D instance networks through a staged curriculum (Wang et al., 2 Nov 2025).

The following table summarizes representative published systems and their architectural paradigm for class-agnostic mask proposals:

Reference	Modality	Core Proposal Mechanism
(Jiao et al., 2022) MM-Former	2D	Transformer mask decoder, N queries
(Wilms et al., 2021) DeepFH	2D	Deep-feature superpixels, classifier
(Luiten et al., 2018) PReMVOS	2D/Video	Mask R-CNN binary mask head
(Nguyen et al., 2023) Open3DIS	3D	2D-guided 3D superpoint clustering
(Nguyen et al., 2024) Any3DIS	3D	SAM-2 2D tracking + DP refinement
(Huang et al., 2022) JC²A CKMM	2D	Averaged base-class feature mining

5. Empirical Performance and Ablation Studies

Empirical results consistently show that class-agnostic proposal generators achieve near upper-bound instance recall and high mask quality, supporting their downstream utility:

For MM-Former POS, oracle mIoU on Pascal-5ⁱ is 82.5% with N=100 proposals. Omission of the classification head increases proposal quality (oracle mIoU: 82.5% vs. 78.9% with class head) (Jiao et al., 2022).
DeepFH superpixel refinement achieves AR@100 = 0.213 on COCO+LVIS, with sharp boundary adherence evidenced by a boundary recall of 0.700, the best in its cohort (Wilms et al., 2021).
Weakly supervised BoxCaseg mask proposals, after merge/drop, deliver mIoU*=71.8% and IoU@50=83.4% on PASCAL, with final proxy Mask R-CNN AP_50=67.6%, matching full supervision (Wang et al., 2021).
In 3D, Any3DIS achieves AP=22.2 and recall RC=63.8 (IoU≥0.5) on ScanNet++, substantially outperforming prior art (Nguyen et al., 2024).

Ablation studies further show that increasing the number of mask proposals increases both upper-bound mIoU and segmentation quality, while adding class-aware components or reducing proposal count impairs both recall and practical performance (Jiao et al., 2022, Wilms et al., 2021).

6. Application Domains and Extensions

Class-agnostic mask proposals are deployed across diverse domains:

Few-shot and open-vocabulary segmentation: enabling matching to novel object classes with minimal adaptation (Jiao et al., 2022, Huang et al., 2022).
Medical image segmentation: decoupled mask/class heads in MaskMed increase Dice scores on AMOS and BTCV by up to +7.0%, confirming task-agnostic generalization (Xie et al., 19 Nov 2025).
3D scene parsing and open-vocabulary 3D instance segmentation: 2D mask tracking and geometric proposal fusion outperform purely geometric or 2D-only strategies on ScanNet and S3DIS (Nguyen et al., 2023, Nguyen et al., 2024).
Object removal and inpainting: class-agnostic masks tuned to maximize downstream inpainting quality rather than standard semantic segmentation metrics (Oh et al., 2023).
Weakly-supervised instance segmentation: using box- or image-level annotations augmented by class-agnostic mask generators to approach full-supervision accuracy (Xie et al., 2021, Wang et al., 2021).

7. Limitations and Future Directions

Key open challenges and directions include:

Redundant or over-segmented proposals in unconstrained scenarios (Any3DIS reduces this by ≈30%, but further improvement is needed) (Nguyen et al., 2024).
Quality of underlying backbone/segmenter; errors in 2D mask generation propagate irreducibly in 3D settings (Wang et al., 2 Nov 2025).
Heuristics for merging/selection (merge by size, drop by IoU) can be dataset- and domain-sensitive; learning-based integration may increase robustness, but is rarely class-agnostic (Wang et al., 2021).
Integration with weak language priors or generic objectness cues may support joint category-agnostic/open-vocabulary retrieval without degrading proposal purity (Nguyen et al., 2024, Huang et al., 2022).
Efficient scaling to long sequences in video/3D tracking, and relaxing greedy/greedy-deterministic view selection to more global optimization (e.g., integer programming, beam search) (Nguyen et al., 2024).
Unified benchmarks and metrics for proposal recall, segmentation quality, and downstream retrieval in both 2D and 3D, supporting reproducible comparison across architectures and domains (Jiao et al., 2022, Wilms et al., 2021, Nguyen et al., 2024).

Class-agnostic mask proposal generation underpins nearly all modern segmentation and recognition pipelines where flexibility, generality, and robustness to category shift are required. Current trends progressively emphasize modularity, geometry-only supervision, and compositional integration with class-aware modules, and advances in this domain are central for scalable vision systems across science and engineering.