Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LVIS: A Dataset for Large Vocabulary Instance Segmentation (1908.03195v2)

Published 8 Aug 2019 in cs.CV

Abstract: Progress on object detection is enabled by datasets that focus the research community's attention on open challenges. This process led us from simple images to complex scenes and from bounding boxes to segmentation masks. In this work, we introduce LVIS (pronounced `el-vis'): a new dataset for Large Vocabulary Instance Segmentation. We plan to collect ~2 million high-quality instance segmentation masks for over 1000 entry-level object categories in 164k images. Due to the Zipfian distribution of categories in natural images, LVIS naturally has a long tail of categories with few training samples. Given that state-of-the-art deep learning methods for object detection perform poorly in the low-sample regime, we believe that our dataset poses an important and exciting new scientific challenge. LVIS is available at http://www.lvisdataset.org.

Citations (1,204)

Summary

  • The paper presents LVIS as a comprehensive dataset with over 2 million segmentation masks across 1,000+ categories, addressing low-shot learning challenges.
  • The paper employs a multi-stage, federated annotation pipeline that ensures high-quality masks and reflects a realistic long-tailed distribution of objects.
  • The paper validates LVIS with strong IoU and boundary quality results, outperforming benchmarks like COCO in instance segmentation accuracy.

LVIS: A Dataset for Large Vocabulary Instance Segmentation

The paper presents the LVIS (Large Vocabulary Instance Segmentation) dataset, proposing a new large-scale benchmark specifically designed to address the challenges of instance segmentation with a vast vocabulary of object categories. Traditional datasets like COCO and ImageNet have spearheaded advancements in object detection and segmentation, but they primarily focus on a limited number of categories with ample training data for each. However, LVIS emphasizes the need for models to learn efficiently from a few examples, covering over 1000 categories, thereby expanding the scope and ambition of instance segmentation research.

Dataset Overview and Collection Methodology

LVIS aims to nucleate research by providing a comprehensive and challenging dataset. The dataset is expected to contain 2 million high-quality instance segmentation masks over 164,000 images, covering more than 1000 object categories. To gather robust data, LVIS uses a federated dataset design, which integrates many smaller sets each exhaustively annotated for a specific category. This approach ensures that categories are represented with varying frequencies, reflecting a long-tailed distribution often found in natural settings.

The dataset collection follows a multi-stage annotation pipeline:

  1. Object Spotting: Annotators mark single instances of various categories.
  2. Exhaustive Instance Marking: All instances of a given category in the images are marked.
  3. Instance Segmentation: Detailed segmentation masks are created for the marked instances.
  4. Segment Verification: Quality of the segmentation masks is verified by multiple annotators.
  5. Full Recall Verification: Ensures all instances of a category are marked in positive set images.
  6. Negative Sets: Verification that certain categories do not appear in selected images, creating a negative dataset.

The annotation process aims to yield high-quality masks close to those created by expert annotators, surpassing the current standard provided by datasets such as COCO and ADE20K.

Unification with Existing Datasets

LVIS maintains continuity with COCO through the adoption of a similar task and AP (average precision) metric for quantitative evaluation. However, LVIS introduces numerous categories that are beyond the 80 delineated by COCO, emphasizing entry-level categories that increase both breadth and depth. The categorization framework is grounded in WordNet synsets to avoid synonym fragmentation that could complicate annotation consistency and dataset usability.

Strong Numerical Results and Empirical Validation

The paper provides compelling numerical results affirming the quality and consistency of LVIS annotations:

  • The average mask intersection over union (IoU) shows high consistency (0.85) between repeated annotations, underscoring the repeatability of the annotation pipeline.
  • Comparisons with expert-annotated masks exhibit superior IoU and boundary quality metrics, positioning LVIS ahead of COCO and ADE20K.

Furthermore, analysis of instance statistics illustrates LVIS's complexity:

  • Distribution of object categories per image is more diverse than COCO.
  • LVIS includes numerous small objects and non-central placements, reflecting real-world image complexity.

Future Implications and Directions

LVIS establishes a new paradigm for instance segmentation research. The long-tailed distribution and significant presence of low-shot learning scenarios present novel challenges, fostering innovations in algorithms capable of generalizing from limited samples. By simulating realistic scenarios more closely aligned with natural image distributions, LVIS sets the stage for advancements in both theoretical and practical aspects of machine learning and computer vision.

The federated dataset approach, in particular, reduces annotation workload and enhances manageability without compromising the quality and coverage of the categories. Future developments might explore dynamic dataset extensions and adaptive annotation strategies tailoring to evolving research needs.

Conclusion

LVIS is poised to shift the landscape of instance segmentation research. By providing a high-quality, large-vocabulary dataset, it enables a renewed focus on the low-shot learning regime and mirrors real-world challenges more accurately than previous datasets. This effort contributes a crucial resource that will likely drive forward the state of the art in computer vision, inviting algorithms that are more robust, versatile, and capable of handling the intricate diversity of visual objects encountered in the wild.