Instance Proposal Pipelines

Updated 29 March 2026

Instance proposal pipelines are modular workflows that generate candidate regions or point sets in images and point clouds using both top-down and bottom-up strategies.
These pipelines integrate backbone feature extraction, proposal generation, and clustering modules to enhance segmentation accuracy and efficiency.
They are evaluated with metrics like AP and IoU recall and have evolved to handle modalities such as video, 3D, and biomedical imagery.

An instance proposal pipeline is a modular architectural and algorithmic workflow that generates candidate regions, shapes, or point sets within an image or point cloud which are likely to correspond to object instances. This concept underpins modern instance segmentation for 2D and 3D visual data, enabling downstream tasks to localize, classify, and delineate distinct entities. Approaches span both top-down paradigms, where objectness is hypothesized and refined, and bottom-up paradigms that cluster elements or points into instances directly. Recent research integrates these philosophies with learned embeddings, clustering, generative modeling, and novel rectification or association mechanisms to improve accuracy, efficiency, and generalization.

1. Architectural Paradigms and Classifications

Instance proposal pipelines have evolved along several taxonomies defined by modality, architecture, and clustering strategy:

Top-down (proposal-based) methods generate coarse candidate regions via detector-like mechanisms, often followed by mask prediction or refinement. Methods such as Mask R-CNN and derivatives operate on region proposals using ROIAlign, with modules designed to improve proposal localization or mask completeness (Zhangli et al., 2022).
Bottom-up (segmentation-then-grouping) methods first segment pixels or points by semantic class, then cluster these using learned embeddings or similarity metrics, as in EmbedMask and SGPN (Ying et al., 2019, Wang et al., 2017).
Hybrid methods exploit both explicit proposals and embedding-based or generative strategies, enabling high-resolution masks and robust instance association (Ying et al., 2019, Yi et al., 2018).
3D and video extensions apply analogous principles to point clouds (e.g., GSPN, 3D-MPA) or spatiotemporal data (e.g., InsPro, EfficientVIS), introducing modalities for temporal correspondence and shape priors (He et al., 2023, Wu et al., 2022, Yi et al., 2018, Engelmann et al., 2020, Mei et al., 2023).

This diversity reflects progress toward increased proposal precision, flexibility, and task generalization.

2. Core Algorithmic Components

Nearly all instance proposal pipelines share several key algorithmic modules:

Backbone Feature Extractor: Convolutional or transformer-based networks (e.g., ResNet-FPN, U-Net, PointNet++) produce multi-scale features suitable for dense prediction (Ying et al., 2019, Upschulte et al., 2021, Yi et al., 2018).
Proposal/Query Generation: Methods either hypothesize candidate regions (boxes, contours, seed points, or embeddings) or propagate learned queries across space or time. For example:
- Box proposals via anchor-based RPNs or anchor-free heads (Zhangli et al., 2022, Liang et al., 2015).
- Seed point clustering (e.g., Bubble Shifting in PANet (Mei et al., 2023)) or voting (3D-MPA (Engelmann et al., 2020)).
- Embedding proposals such as pixel/proposal pairs (EmbedMask (Ying et al., 2019)) or similarity graphs (SGPN (Wang et al., 2017)).
- Generative proposals using CVAE decoders in GSPN (Yi et al., 2018).
Clustering/Assignment: Embedding distance clustering, connected-components analysis, or dynamic assignment strategies assign elements to proposals. For instance:
- Gaussian embedding coupling in EmbedMask (Ying et al., 2019).
- Thresholded similarity matrices in SGPN (Wang et al., 2017).
- Spatial graph clustering for 3D proposals (Mei et al., 2023, Engelmann et al., 2020).
- Temporal query-propagation and slot matching in video pipelines (He et al., 2023, Wu et al., 2022).
Mask/Contour/Box Refinement: Modules refine initial proposals for higher-quality mask outputs. Examples include:
- Progressive ROIAlign and attentive FFN for proposal rectification (Zhangli et al., 2022).
- Contour regression via Fourier descriptors in CPN (Upschulte et al., 2021).
- Denoising autoencoders for dominant instance extraction (R2-IOS (Liang et al., 2015)).
Non-maximum Suppression, Consensus Grouping, or Proposal Aggregation: Classical NMS, DBSCAN in feature space, or learned affinity aggregation is used to filter or merge proposals into final instances (Wang et al., 2017, Engelmann et al., 2020, Mei et al., 2023).

3. Representative Pipelines and Their Mechanisms

The following table summarizes principled approaches reflected in recent literature, illustrating the diversity of instance proposal pipelines:

Method / Paper	Proposal Modalities	Clustering & Assignment
EmbedMask (Ying et al., 2019)	Box + embedding pair	Gaussian-embedding coupling, pixel-to-proposal assignment, no ROI pooling
CPN (Upschulte et al., 2021)	Pixel-anchored contour	Fourier descriptor regression, bounding box NMS, contour-based mask rasterization
SGPN (Wang et al., 2017)	Point similarity rows	Thresholded per-row groups, IoU-NMS, semantic mode assignment
PANet (Mei et al., 2023)	Voxel seed points	Bubble Shifting (graph mean), CCL, affinity-based instance merging
GSPN (Yi et al., 2018)	Generative shapes	CVAE shape synthesis, objectness scoring, 3D NMS, ROIAlign
EfficientVIS (Wu et al., 2022)	Video query-proposals	Temporal–spatial attention, RoI-tube dynamic conv, learned clip-to-clip linking
R2-IOS (Liang et al., 2015)	Recursive box proposals	Proposal refinement, segmentation-aware features, denoising autoencoder for mask selection

Each system optimizes for a different combination of proposal fidelity, efficiency, and robustness, shaping both architectural and loss function choices.

4. Losses and Training Objectives

Correct supervision of proposals and clustering is crucial. Popular loss components include:

Detection/classification losses: e.g., focal loss, cross-entropy for class and objectness scores (Ying et al., 2019, Liang et al., 2015).
Box/contour regression: robust L1 (Smooth-L1), IoU-based, or chamfer distance for proposals or shape synthesis (Yi et al., 2018, Ying et al., 2019, Zhangli et al., 2022).
Embedding similarity/affinity: double-hinge for enforcing intra/inter-instance separation (Wang et al., 2017), or Lovász-hinge for mask label probability (Ying et al., 2019).
Auxiliary constraints: e.g., smoothness losses for embedding margins (Ying et al., 2019), regularization of Fourier descriptors (Upschulte et al., 2021), KL divergence in generative networks (Yi et al., 2018).
Grouping/assignment-specific losses: e.g., Hungarian assignment cost in video or 3D pipelines, affinity aggregation for merging fragments (Wu et al., 2022, Mei et al., 2023).
Multi-task combination: Typical pipelines sum per-module losses with empirically-tuned weights or annealing schedules (Ying et al., 2019, Liang et al., 2015).

This composite loss landscape guides training towards both precise proposal localization and reliable instance separation.

5. Evaluation, Diagnostics, and Pitfalls

Instance proposal pipelines are primarily evaluated via recall at IoU, AP (mask or box), and specifically for segmentation, mean F1 or PQ. Key protocol considerations and pitfalls include:

Evaluation metric bias: As shown by Chavali et al., metrics based on partially-annotated datasets are susceptible to “gameable” behavior by detectors trained for benchmark categories. Recall and AR may not reflect category independence unless a nearly-fully annotated or cross-dataset protocol is used (Chavali et al., 2015).
Bias capacity diagnostic: By varying the training object vocabulary, one can measure how dependent proposal performance is on seen categories, discriminating true category-agnostic proposal generators from detector-masquerading pipelines (Chavali et al., 2015).
Cross-domain, cross-dataset generalization: Robust pipelines maintain recall for both core and non-core categories, showing limited drop when evaluated on broader annotations (Chavali et al., 2015, Upschulte et al., 2021).
Grouping vs. NMS tradeoff: Methods like 3D-MPA avoid discarding valid proposals by grouping proposals in a learned feature space, circumventing the hard suppression artifacts of NMS (Engelmann et al., 2020).

Proper evaluation and ablation are essential for fair assessment and to avoid overfitting to annotation artifacts.

Instance proposal pipelines have been extensively adapted to new settings:

Video instance segmentation: Query-proposal propagation and implicit instance association across frames leverages temporal consistency without requiring explicit tracking heads or post-hoc matching (He et al., 2023, Wu et al., 2022).
3D point cloud and LiDAR segmentation: Methods such as SGPN, GSPN, 3D-MPA, and PANet generalize proposal generation to unstructured domains through learned similarity, generative modeling, spatial graph clustering, or explicit seed voting (Yi et al., 2018, Engelmann et al., 2020, Wang et al., 2017, Mei et al., 2023).
Domain-adapted architectures: CPN introduces per-pixel contour regression via Fourier descriptors for handling highly irregular biomedical structures, outperforming classical U-Nets on several cell datasets (Upschulte et al., 2021).
Proposal rectification: Modules such as RPR specifically address over-crop and incomplete-mask pathologies in biological or fine-grained imagery (Zhangli et al., 2022).

These modalities require tailored proposal representations but fundamentally share the instance proposal pipeline structure.

7. Trends, Open Directions, and Best Practices

A clear trend is the merger of top-down accuracy and bottom-up flexibility: architectures increasingly leverage shared embedding spaces, generative reconstructions, or proposal-query association mechanisms to produce high-fidelity, category-agnostic proposals at real-time speeds. Best practices established in the literature include:

Utilizing nearly-fully annotated or densely labeled benchmarks for evaluation (Chavali et al., 2015).
Performing cross-category and cross-dataset generalization studies to expose overfitting.
Adopting multi-headed architectures enabling plug-and-play adaptation to various backbone networks and domains (Upschulte et al., 2021, Mei et al., 2023).
Explicit clustering or grouping, rather than suppression, to aggregate hypothesis and prevent proposal loss (Engelmann et al., 2020, Yi et al., 2018).
Incorporating domain-specific priors (e.g., contour closure, spatial affinity, temporal continuity).

Continued progress in instance proposal pipelines is expected to arise from advances in universal embeddings, geometric priors, and efficient cross-modality transfer.