Can SAM Count Anything? An Empirical Study on SAM Counting (2304.10817v1)

Published 21 Apr 2023 in cs.CV and cs.AI

Abstract: Meta AI recently released the Segment Anything model (SAM), which has garnered attention due to its impressive performance in class-agnostic segmenting. In this study, we explore the use of SAM for the challenging task of few-shot object counting, which involves counting objects of an unseen category by providing a few bounding boxes of examples. We compare SAM's performance with other few-shot counting methods and find that it is currently unsatisfactory without further fine-tuning, particularly for small and crowded objects. Code can be found at \url{https://github.com/Vision-Intelligence-and-Robots-Group/count-anything}.

Citations (18)

View on Semantic Scholar

Summary

The paper evaluates SAM for few-shot counting, revealing strengths in sparse scenes and limitations in densely populated images.
It employs SAM’s dense feature maps and cosine similarity on segmentation masks to differentiate and count target objects.
Empirical findings on FSC-147 and MS-COCO show a 2-10 unit increase in MAE in crowded scenes, indicating the need for semantic fine-tuning.

An Empirical Evaluation of SAM in Few-Shot Object Counting

The paper under review critically evaluates the applicability of the Segment Anything Model (SAM) by Meta AI in the domain of few-shot object counting. Released recently, SAM has shown substantial efficacy in class-agnostic image segmentation, thereby prompting interest in its potential utility across varied visual tasks. The core focus of this paper is to determine whether SAM can effectively address the challenges associated with counting objects of previously unseen categories following minimal example-based training, particularly when constrained to few-shot conditions.

Methodological Overview

SAM is known for its proficiency in segmenting unlabelled images due to an extensive pre-training process utilizing over 1 billion segmentation masks derived from 11 million images. The methodology adopted in this paper leverages SAM’s image encodings to address the task of few-shot counting. The key steps include:

Feature Extraction: SAM generates dense feature maps for the input images using its Vision Transformer (ViT-H) architecture.
Segmentation and Masking: Bounding boxes serve as reference prompts to create segment masks over exemplar objects. These masks interact with dense features to form averaged feature vectors representing target objects.
Object Distinction: Via cosine similarity calculations between the feature vectors of predicted and reference masks, the identity of target objects is established, allowing for their enumeration.

The approach innovates by exclusively utilizing SAM’s intrinsic features without resorting to additional zero-shot detection or classification tools, hence minimizing computational overhead.

Experimental Evaluation

Empirical assessments were conducted on two datasets: FSC-147 and MS-COCO. The former is explicitly structured for few-shot counting, comprising images with annotated instances requiring manual segmentation validation. The latter, COCO’s validation set, offers diverse object categories typical for tasks involving detection and instance segmentation.

Quantitative Assessment: SAM’s outputs were contrasted against a spectrum of established few-shot counting methodologies including FamNet, CFOCNet, and LaoNet. Results denote that, while SAM maintains comparable effectiveness in scenes with less object congestion, it performs suboptimally (evidenced by an approximately 2-10 unit discrepancy in Mean Absolute Error, MAE) within densely object-populated scenarios.
Qualitative Observations: Visual analyses underscore difficulties in SAM’s segmentation accuracy when handling small or overlapping objects, leading to incorrect amalgamation of distinct entities into singular segmented masks.

Discussion and Implications

The findings highlight core challenges in adapting SAM for precise object counting, particularly within densely cluttered or small-item contexts. While SAM’s segmentation capabilities are commendable, its inability to discriminate nuanced object boundaries underlines a gap attributable to its non-semantic mask training regime. The paper suggests possibilities for improvement through advanced fine-tuning techniques or integration with semantically richer datasets.

Theoretically, this research suggests reconsideration of SAM’s architecture and training methodology to encompass semantic discernment. Practically, it illuminates the feasibility of SAM as a component rather than a standalone in few-shot counting systems, opening avenues for its integration with complementary detection models.

Conclusion

This paper provides a comprehensive assessment of SAM’s current limitations and potential within few-shot object counting tasks. As the field of AI continues to evolve, such empirical scrutinies are pivotal in guiding both the refinement of existing models and inspiring novel methodological innovations that bridge capability gaps in dynamic visual analysis. Future work may explore hybrid models that augment SAM’s segmentation prowess with advanced object-specific recognition techniques.

PDF Markdown

Related Papers

GitHub

GitHub - Vision-Intelligence-and-Robots-Group/count-anything: an empirical study on few-shot counting using segment anything (SAM) (92 stars)