Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COCONut: Modernizing COCO Segmentation (2404.08639v1)

Published 12 Apr 2024 in cs.CV

Abstract: In recent decades, the vision community has witnessed remarkable progress in visual recognition, partially owing to advancements in dataset benchmarks. Notably, the established COCO benchmark has propelled the development of modern detection and segmentation systems. However, the COCO segmentation benchmark has seen comparatively slow improvement over the last decade. Originally equipped with coarse polygon annotations for thing instances, it gradually incorporated coarse superpixel annotations for stuff regions, which were subsequently heuristically amalgamated to yield panoptic segmentation annotations. These annotations, executed by different groups of raters, have resulted not only in coarse segmentation masks but also in inconsistencies between segmentation types. In this study, we undertake a comprehensive reevaluation of the COCO segmentation annotations. By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5.18M panoptic masks, we introduce COCONut, the COCO Next Universal segmenTation dataset. COCONut harmonizes segmentation annotations across semantic, instance, and panoptic segmentation with meticulously crafted high-quality masks, and establishes a robust benchmark for all segmentation tasks. To our knowledge, COCONut stands as the inaugural large-scale universal segmentation dataset, verified by human raters. We anticipate that the release of COCONut will significantly contribute to the community's ability to assess the progress of novel neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Slic superpixels compared to state-of-the-art superpixel methods. TPAMI, 34(11):2274–2282, 2012.
  2. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  3. End-to-end object detection with transformers. In ECCV, 2020.
  4. ViTamin: Designing Scalable Vision Models in the Vision-Language Era. arXiv preprint arXiv:2404.02132, 2024.
  5. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015a.
  6. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2017.
  7. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  8. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015b.
  9. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  10. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
  11. Boundary IoU: Improving object-centric image segmentation evaluation. In CVPR, 2021.
  12. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  13. Open-vocabulary panoptic segmentation with maskclip. In ICML, 2023.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  15. The pascal visual object classes (voc) challenge. IJCV, 88:303–338, 2010.
  16. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  17. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
  18. Ross Girshick. Fast r-cnn. In ICCV, 2015.
  19. Dataseg: Taming a universal multi-dataset multi-task segmentation model. NeurIPS, 2023.
  20. Densepose: Dense human pose estimation in the wild. In CVPR, 2018.
  21. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  22. Simultaneous detection and segmentation. In ECCV, 2014.
  23. Maxtron: Mask transformer with trajectory attention for video panoptic segmentation. arXiv preprint arXiv: 2311.18537, 2023.
  24. Deep residual learning for image recognition. In CVPR, 2016.
  25. Mask r-cnn. In ICCV, 2017.
  26. Multiscale conditional random fields for image labeling. In CVPR, 2004.
  27. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  28. Segment anything in high quality. In NeurIPS, 2023.
  29. Panoptic feature pyramid networks. In CVPR, 2019a.
  30. Panoptic segmentation. In CVPR, 2019b.
  31. Segment anything. In ICCV, 2023.
  32. F-vlm: Open-vocabulary object detection upon frozen vision and language models. In ICLR, 2023.
  33. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 128(7):1956–1981, 2020.
  34. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  35. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023b.
  36. Microsoft coco: Common objects in context. In ECCV, 2014.
  37. Feature pyramid networks for object detection. In CVPR, 2017.
  38. Visual instruction tuning. NeurIPS, 2023.
  39. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  40. A convnet for the 2020s. In CVPR, 2022.
  41. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  42. High-quality entity segmentation. In ICCV, 2023.
  43. The role of context for object detection and semantic segmentation in the wild. In CVPR, 2014.
  44. Nms strikes back. arXiv preprint arXiv:2212.06137, 2022.
  45. Extreme clicking for efficient object annotation. In ICCV, 2017.
  46. Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV, 2018.
  47. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In CVPR, 2021.
  48. Learning transferable visual models from natural language supervision. In ICML, 2021.
  49. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 2015.
  50. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  51. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
  52. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  53. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  54. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  55. Video-kmax: A simple unified approach for online and near-online video panoptic segmentation. In WACV, 2024.
  56. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
  57. Cfr-icl: Cascade-forward refinement with iterative click loss for interactive image segmentation. arXiv preprint arXiv:2303.05620, 2023.
  58. Remax: Relaxing for better training on efficient panoptic segmentation. NeurIPS, 2024.
  59. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In ECCV, 2020.
  60. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In CVPR, 2021.
  61. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023.
  62. Deeplab2: A tensorflow library for deep labeling. arXiv preprint arXiv:2106.09748, 2021.
  63. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023.
  64. Polymax: General dense prediction with mask transformer. In WACV, 2024.
  65. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022a.
  66. Scaling autoregressive models for content-rich text-to-image generation. TMLR, 2022b.
  67. Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In CVPR, 2022c.
  68. k-means Mask Transformer. In ECCV, 2022d.
  69. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. NeurIPS, 2023a.
  70. Towards open-ended visual recognition with large language model. arXiv preprint arXiv:2311.08400, 2023b.
  71. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. In ICLR, 2023.
  72. Scene parsing through ade20k dataset. In CVPR, 2017.
  73. Benchmarking a benchmark: How reliable is ms-coco? In ICCV Datacomp Workshop, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xueqing Deng (25 papers)
  2. Qihang Yu (44 papers)
  3. Peng Wang (832 papers)
  4. Xiaohui Shen (67 papers)
  5. Liang-Chieh Chen (66 papers)
Citations (4)

Summary

COCONut: Enhancing Image Segmentation with High-Quality, Large-Scale Datasets

Introduction to COCONut Dataset

The Common Objects in Context (COCO) dataset has been a mainstay in computer vision research, particularly in the realms of object detection and image segmentation. Despite its widespread application, the growth of machine learning capabilities has outpaced the advancements in dataset quality, particularly concerning segmentation tasks. In response to this challenge, the authors introduce COCONut (COCO Next Universal segmenTation dataset), aiming to modernize COCO's segmentation capabilities by augmenting annotation quality and dataset size. This new dataset comprises approximately 383K images with over 5.18M panoptic masks, making it a comprehensive resource for semantic, instance, and panoptic segmentation tasks. COCONut is characterized by its high-quality, human-verified segmentation masks, offering a robust benchmark that promises to facilitate significant progress in image understanding tasks.

Reevaluation and Enhancement of Existing Annotations

COCONut's creation involved a thorough reevaluation of COCO's existing annotations, identifying several issues such as over-annotations, missing labels, and inaccurate segmentations — particularly in 'stuff' classes. The authors addressed these deficiencies by redesigning the annotation pipeline, integrating modern neural networks to generate initial annotations, which were then refined through a meticulous manual editing process. This approach significantly improved the consistency and quality of the resulting segmentation masks.

Data Splits and Sources

The COCONut dataset includes images from the original COCO dataset and Objects365, offering a more extensive collection with varied and robust training and validation sets. The dataset is divided into several splits for scalability, with COCONut-S (small) including 118K images, COCONut-B (base) with 242K images, and COCONut-L (large) encompassing 358K images. This organization ensures that researchers can select a data split that aligns with their computational resources and experimental needs.

The Innovative Annotation Pipeline

Central to the success of COCONut is the novel annotation process, combining machine learning-generated proposals with human verification and refinement. This assisted-manual annotation pipeline significantly increases the efficiency of producing high-quality masks. It involves an initial automatic proposal generation, followed by a detailed human inspection and editing phase, leading to the final verification by experts. This pipeline ensures that COCONut's masks exhibit superior quality compared to their COCO counterparts, addressing intricate details and maintaining consistency across different segmentation tasks.

Dataset's Implications and Future Directions

The COCONut dataset's introduction has profound implications for both theoretical and practical aspects of computer vision research:

  1. Benchmarking and Model Evaluation: COCONut provides a much-needed platform for evaluating advanced neural network models, especially those requiring high-quality, diverse annotations for accurate segmentation tasks.
  2. Progress in Model Development: With its comprehensive coverage and high-quality annotations, COCONut is poised to drive advancements in segmentation model accuracy and efficiency.
  3. Research on Annotation Efficiency: The dataset's creation process offers valuable insights into balancing automated and manual annotation methods, contributing to ongoing discussions about dataset scalability and quality.

With its release, COCONut challenges the research community to leverage this rich resource for developing more sophisticated image understanding algorithms. Future work may involve exploring more efficient annotation techniques, expanding the dataset with new classes and images, and creating models that can exploit this dataset's depth and quality to set new benchmarks in image segmentation tasks.

Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews