SNIPER: Efficient Multi-Scale Training (1805.09300v3)

Published 23 May 2018 in cs.CV

Abstract: We present SNIPER, an algorithm for performing efficient multi-scale training in instance level visual recognition tasks. Instead of processing every pixel in an image pyramid, SNIPER processes context regions around ground-truth instances (referred to as chips) at the appropriate scale. For background sampling, these context-regions are generated using proposals extracted from a region proposal network trained with a short learning schedule. Hence, the number of chips generated per image during training adaptively changes based on the scene complexity. SNIPER only processes 30% more pixels compared to the commonly used single scale training at 800x1333 pixels on the COCO dataset. But, it also observes samples from extreme resolutions of the image pyramid, like 1400x2000 pixels. As SNIPER operates on resampled low resolution chips (512x512 pixels), it can have a batch size as large as 20 on a single GPU even with a ResNet-101 backbone. Therefore it can benefit from batch-normalization during training without the need for synchronizing batch-normalization statistics across GPUs. SNIPER brings training of instance level recognition tasks like object detection closer to the protocol for image classification and suggests that the commonly accepted guideline that it is important to train on high resolution images for instance level visual recognition tasks might not be correct. Our implementation based on Faster-RCNN with a ResNet-101 backbone obtains an mAP of 47.6% on the COCO dataset for bounding box detection and can process 5 images per second during inference with a single GPU. Code is available at https://github.com/MahyarNajibi/SNIPER/.

Citations (470)

View on Semantic Scholar

Summary

The paper introduces a chip-based training strategy that processes only essential image regions for instance-level recognition.
It uses multi-scale chips and a region proposal network to dynamically adjust computational load based on scene complexity.
Experiments on COCO show 47.6% mAP and five images per second, highlighting notable efficiency gains and accuracy.

Overview of SNIPER: Efficient Multi-Scale Training

The SNIPER algorithm, as introduced by Bharat Singh, Mahyar Najibi, and Larry S. Davis, proposes a novel approach to multi-scale training for instance-level visual recognition tasks, focusing on efficiency without sacrificing performance. The primary innovation lies in processing only essential context regions, referred to as "chips," from image pyramids. By concentrating on chips surrounding ground-truth instances and utilizing a region proposal network (RPN) for background sampling, SNIPER adapts its computational load based on scene complexity.

Key Concepts and Methodology

SNIPER addresses the inefficiencies in traditional multi-scale training where every pixel is processed at each resolution scale. Contrarily, SNIPER selectively processes regions, thereby significantly reducing computational expense. For example, on the COCO dataset, it processes merely 30% more pixels than single-scale training while extending its scope to extreme resolutions in image pyramids.

The algorithm operates on 512x512 pixel chips and accommodates a batch size of up to 20 per GPU with a ResNet-101 backbone. This capability enables the utilization of batch-normalization without necessitating synchronization across GPUs, a marked deviation from the previously held necessity of high-resolution image training for instance-level tasks.

Numerical Results and Claims

SNIPER demonstrates a mean average precision (mAP) of 47.6% on the COCO dataset with a bounding box detection speed of five images per second on a single GPU. It processes approximately five chips per image, optimizing the balance between computational load and recognition accuracy. The paper also emphasizes that contrary to prevailing assumptions, high image resolution may not be essential for efficient training in instance-level recognition.

Implications and Future Directions

Practically, SNIPER signifies a step toward more resource-efficient training practices, allowing for lower-resolution processing while maintaining high recognition capabilities. Theoretically, it questions and potentially shifts the understanding of necessary training scales and contexts for object detection.

Future developments could explore further optimization in multi-scale inference, particularly in reducing background processing, thereby enhancing inference efficiency. It also raises the question of determining optimal chip resolutions that maintain context advantages without detriment.

In conclusion, by focusing on adaptive sampling and scale-specific context-regions, SNIPER not only achieves significant improvements in training efficiency but also challenges existing paradigms about the requisites for high-performing instance-level recognition systems. The research lays a foundation for further explorations in efficient algorithmic designs that could reshape practices in computer vision and beyond.

PDF Markdown