COCO-ReM: Refined COCO Mask Annotations

Updated 17 June 2026

The paper introduces a three-stage, semi-automatic re-annotation pipeline that refines mask boundaries and corrects label inconsistencies.
The paper details extensive improvements including precise mask edges, corrected occlusion handling, and additional instances that boost AP gains by up to 7 points at high IoU thresholds.
The paper demonstrates that models trained on COCO-ReM converge faster and achieve superior accuracy, resulting in shifts in model rankings among high-performing object detectors.

COCO-ReM (COCO Refined Masks) is a comprehensive re-annotation of the COCO-2017 instance segmentation dataset, designed to address critical annotation errors and inaccuracies that have emerged as limiting factors in benchmarking high-performing object detectors. By refining mask boundaries, ensuring more exhaustive object coverage, and correcting label inconsistencies, COCO-ReM delivers a cleaner set of annotations with the goal of restoring the reliability of COCO benchmarking while maintaining compatibility with prior research conventions (Singh et al., 2024).

1. Motivation and Scope

COCO-2017, despite its foundational role in object detection research, contains pervasive errors affecting both model training and evaluation. Key issues include:

Imprecise mask boundaries: The polygon-based annotation interface uses straight-line segments to approximate object contours, resulting in coarse mask edges.
Absence of holes in masks: Annotation procedures precluded internal voids, so masks for objects like cup handles or scissor blades are often filled solidly, not reflecting true structure.
Inconsistent occlusion handling: There is no standardization; some masks are amodal (covering occluders), others modal (cutting around occluders).
Non-exhaustive instance coverage: Some images have incomplete instance annotations, and “grouped” masks may merge objects of a given class.
Near-duplicate masks across categories: Approximately 410 pairs in val-2017 (2.3% of instances) share IoU $>0.8$ but carry different category labels.

These deficiencies degrade detector performance and evaluation, particularly as model Mask AP surpasses 70%. Addressing them while preserving COCO’s 80-class structure and open-source flexibility was prioritized to ensure research continuity (Singh et al., 2024).

2. Annotation Correction Methodology

COCO-ReM employs a three-stage, semi-automatic pipeline to revise both train and validation splits:

Stage 1: Mask Boundary Refinement

Utilizes SAM (Segment Anything Model) prompting with bounding boxes and error-region points for each mask (10 repeats per mask, majority-voted pixel labels).
Manual inspection and interactive correction for all ∼36K val masks, with ∼900 requiring further intervention for precision, targeting categories with frequent occlusions or holes.

Stage 2: Exhaustive Instance Annotation

LVIS imports: Where LVIS provides more non-crowd instances for an (image, category) pair, the more complete LVIS annotations replace COCO’s.
Model-assisted discovery: An ensemble of three LVIS-trained ViTDet models proposes high-confidence extra masks, with manual review yielding 1,056 new val instances, each with refined boundaries.

Stage 3: Label Correction (val only)

Manual disambiguation of near-duplicate, high-IoU category masks.
Remaining grouped-instance masks (~100) retagged as "crowd".

The validation split is fully author-verified; automated-only steps are applied to the training split (Singh et al., 2024).

3. Dataset Characteristics and Format

COCO-ReM increases both accuracy and exhaustiveness relative to COCO-2017. Key statistics:

Split	COCO-2017	COCO-ReM
val	36,781	40,689
train	860,001	1,093,027

Median IoU between COCO-ReM val masks and source (COCO/LVIS) ≈ 0.97: the majority of shapes persist, with boundary definition significantly sharpened.
Mask “holes”: ≈2,000 val masks now include true holes, compared to zero in COCO.
Annotation Quality: Strong boundary displacement reductions; AP gains of 5–7 points at IoU thresholds ≥0.75.
Distribution and Format: Available at https://cocorem.xyz. Annotations are in standard COCO JSON with polygons replaced by bitmasks. All images remain under the original Flickr-CC license (Singh et al., 2024).

4. Evaluation Metrics and Sensitivity

COCO-ReM retains the original COCO evaluation protocol:

IoU calculation: $\operatorname{IoU}(G,P) = \frac{|G \cap P|}{|G \cup P|}$
TP/FP/FN at threshold $t$
Precision–Recall: $P(R;t) = \frac{TP(t)}{TP(t)+FP(t)}$ , with recall $R = \frac{TP(t)}{TP(t)+FN(t)}$
Average Precision: $AP_t = \int_0^1 P(R;t)dR$
COCO mAP: $AP = \frac{1}{10}\sum_{k=1}^{10} AP_{0.45 + 0.05k}$ , with special points $AP_{50}$ , $AP_{75}$

No new metrics are introduced, but COCO-ReM’s higher mask fidelity sharpens sensitivity to boundary accuracy, especially at higher IoU thresholds (Singh et al., 2024).

5. Experimental Results and Comparative Analysis

Extensive benchmarking of 50 object detectors across COCO-ReM and COCO-2017 reveals:

AP Improvements Across Architectures: Every model class—region-based (Mask R-CNN, Cascade R-CNN, ViTDet) and query-based (Mask2Former, OneFormer)—improves in AP. State-of-the-art OneFormer InternImage-H shows a +7.7 AP gain.
Model Ranking Shifts: Query-based models overtake region-based on COCO-ReM, corresponding with human preference for sharper masks.
Threshold Sensitivity: At IoU=0.5, AP increases modestly (+0.2–0.5); at IoU≥0.75, gains reach +3–8, confirming improvements are primarily due to more precise boundaries.
Category-specific Effects: 69/80 categories see AP gains, largest for fine detail (e.g., fork, carrot, orange). 11 categories show decreased AP, mainly where prior “no-hole” annotation bias penalized correct predictions.
Annotation Stages (Ablation):
- Stage 1 (boundary refinement) alone recovers ~7.7 AP of noise.
- Exhaustiveness (Stage 2) adds little net AP; slightly down due to increased set size.
Training Efficiency:
- Models trained on COCO-ReM converge faster and achieve higher AP at equivalent or smaller model size.
- Example: Mask R-CNN ViTDet-B: 100 epochs on COCO-2017 yields AP ≈ 49.1; COCO-ReM achieves AP ≈ 52.4. COCO-ReM-trained models match the performance of much larger (3×) models trained on COCO-2017 (Singh et al., 2024).

6. Best Practices, Limitations, and Future Directions

Recommended Usage:

Employ COCO-ReM masks for both training and validation.
Retain original image splits and hyperparameters; no architecture or evaluation code change needed.

Limitations:

SAM may hallucinate small spurious components. Niche, fine-detail categories may benefit from additional review.
COCO-ReM does not re-annotate the private test set; thus, official test-dev still uses original COCO-2017 masks.

Refinement Directions:

Extension to COCO stuff and panoptic splits.
Human re-annotation focusing on largest boundary outlier categories (±500 instances with IoU<0.8).
Broadening of external annotation imports (e.g., OpenImages) to increase exhaustiveness.
Community-driven corrections via the public GitHub repository (https://github.com/kdexd/coco-rem) (Singh et al., 2024).

Significance:

COCO-ReM provides mask annotations significantly closer to perceptual ground truth, rectifies critical ranking pathologies in high-performing segmenters, and substantially improves both detector training efficiency and final benchmark AP—all without disrupting prevalent codebases or evaluation methodologies. The adoption of COCO-ReM is advised for future research in object detection and instance segmentation (Singh et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Benchmarking Object Detectors with COCO: A New Path Forward (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to COCO-ReM.