Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation (1503.01640v2)

Published 5 Mar 2015 in cs.CV

Abstract: Recent leading approaches to semantic segmentation rely on deep convolutional networks trained with human-annotated, pixel-level segmentation masks. Such pixel-accurate supervision demands expensive labeling effort and limits the performance of deep networks that usually benefit from more training data. In this paper, we propose a method that achieves competitive accuracy but only requires easily obtained bounding box annotations. The basic idea is to iterate between automatically generating region proposals and training convolutional networks. These two steps gradually recover segmentation masks for improving the networks, and vise versa. Our method, called BoxSup, produces competitive results supervised by boxes only, on par with strong baselines fully supervised by masks under the same setting. By leveraging a large amount of bounding boxes, BoxSup further unleashes the power of deep convolutional networks and yields state-of-the-art results on PASCAL VOC 2012 and PASCAL-CONTEXT.

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

Semantic segmentation is crucial for various applications within computer vision, including autonomous driving, medical imaging, and scene understanding. Traditional approaches for semantic segmentation rely heavily on deep convolutional neural networks (CNNs) trained with pixel-level segmentation masks. Acquiring these masks requires significant manual effort and cost. This paper by Jifeng Dai, Kaiming He, and Jian Sun proposes a novel approach named BoxSup that aims to mitigate this challenge by using bounding box annotations instead of pixel-level masks for supervising convolutional networks.

Methodology

BoxSup leverages bounding box annotations to iteratively generate and refine segmentation masks, using these refined masks to supervise the training of convolutional networks. The method employs a two-step process: (1) generating candidate segmentation masks using unsupervised region proposal methods, and (2) training the CNN using these approximate masks. This iterative process continues to improve the quality of the masks and the performance of the network.

The key steps in BoxSup can be summarized as follows:

  1. Initial Segmentation Mask Generation: Using unsupervised region proposal methods like Multiscale Combinatorial Grouping (MCG) to generate candidate segmentation masks.
  2. Iterative Refinement: Segment labels are iteratively refined by selecting the most appropriate candidates based on both IoU with the ground truth bounding boxes and network feedback.
  3. Network Training: The CNN is trained using these candidate masks, even though they are initially coarse. The updated network further refines the segmentation masks, creating a self-improving loop.

Experimental Results

The BoxSup method was evaluated on the PASCAL VOC 2012 and PASCAL-CONTEXT datasets. The results demonstrate that BoxSup achieves performance on par with, and in some cases exceeding, fully mask-supervised methods:

  • Box-supervised training achieved 62.0% mean IoU (mIoU) on the PASCAL VOC 2012 validation set, against 63.8% mIoU achieved by fully mask-supervised training.
  • Semi-supervised training (using a combination of masks and bounding box annotations) achieved 63.5% mIoU, indicating that BoxSup can significantly reduce the annotation cost with minimal performance loss.
  • On the PASCAL VOC 2012 test set, a box-supervised model trained using 10,582 bounding boxes achieved 64.6% mIoU, outperforming the WSSL method which scored 60.4% mIoU under the same conditions.

Further, BoxSup leverages the COCO dataset's bounding box annotations, augmenting the semantic segmentation capability of the network. When incorporating the COCO bounding box annotations, the BoxSup method achieved 71.0% mIoU, which is notably higher compared to the 70.4% mIoU registered by methods using pixel-level masks.

Implications and Future Directions

The BoxSup approach has significant implications:

  • Practicality: BoxSup reduces the dependency on pixel-level masks, making model training more practical and scalable by leveraging more readily available bounding box annotations.
  • Efficiency: The iterative refinement of network training through proxy masks mitigates the need for extensive manual annotation without substantial drops in segmentation accuracy.
  • Model Generalization: BoxSup exhibits higher generalization by efficiently utilizing large-scale bounding box annotations to improve semantic segmentation tasks.
  • Future Integration: Future work can integrate BoxSup with other state-of-the-art CNN architectures and unsupervised region proposal methods to further enhance model performance.

In conclusion, BoxSup provides a compelling alternative to traditional pixel-level masking for supervising semantic segmentation networks, making large-scale, high-accuracy segmentation both feasible and cost-effective. The methodological innovations and empirical achievements establish BoxSup as a valuable approach for advancing semantic segmentation using bounding box annotations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jifeng Dai (131 papers)
  2. Kaiming He (71 papers)
  3. Jian Sun (415 papers)
Citations (1,015)