Dominant Set Clustering and Pooling for Multi-View 3D Object Recognition

Published 4 Jun 2019 in cs.CV | (1906.01592v1)

Abstract: View based strategies for 3D object recognition have proven to be very successful. The state-of-the-art methods now achieve over 90% correct category level recognition performance on appearance images. We improve upon these methods by introducing a view clustering and pooling layer based on dominant sets. The key idea is to pool information from views which are similar and thus belong to the same cluster. The pooled feature vectors are then fed as inputs to the same layer, in a recurrent fashion. This recurrent clustering and pooling module, when inserted in an off-the-shelf pretrained CNN, boosts performance for multi-view 3D object recognition, achieving a new state of the art test set recognition accuracy of 93.8% on the ModelNet 40 database. We also explore a fast approximate learning strategy for our cluster-pooling CNN, which, while sacrificing end-to-end learning, greatly improves its training efficiency with only a slight reduction of recognition accuracy to 93.3%. Our implementation is available at https://github.com/fate3439/dscnn.

Abstract PDF Upgrade to Chat

Citations (174)

View on Semantic Scholar

Summary

Dominant Set Clustering and Pooling for Multi-View 3D Object Recognition

The paper entitled "Dominant Set Clustering and Pooling for Multi-View 3D Object Recognition" by Chu Wang, Marcello Pelillo, and Kaleem Siddiqi, explores an innovative approach to 3D object recognition using view-based strategies. The authors present a novel technique for pooling multi-view information based on dominant set clustering, which enhances the recognition accuracy by leveraging the similarity of different views of the same object.

Summary of Contributions

Dominant Set Clustering and Pooling Module: The core contribution of the paper is the introduction of a recurrent clustering and pooling module that identifies and aggregates features from similar views (dominant sets) of a 3D object. This is seamlessly integrated into a pre-trained convolutional neural network (CNN) architecture such as VGG-m, and is posited to improve multi-view recognition performance by focusing on view similarity rather than indiscriminate feature pooling.
Performance Improvement: The proposed method demonstrates a notable improvement in recognition accuracy, achieving a test set accuracy of 93.8% on the ModelNet 40 dataset, marking a new state-of-the-art for multi-view 3D object recognition at the time of writing. This represents a significant increment over prior methods that employed full-stride pooling without consideration for view clustering.
Efficient Training Strategy: In addition to their primary method, the authors explore a fast training strategy which bypasses end-to-end learning yet manages a high recognition accuracy of 93.3%. This is accomplished by utilizing an approximate learning strategy that saves computational resources while still maintaining competitive performance.
Fusion of Multiple Feature Modalities: The authors propose the use of additional feature types like depth and surface normals, beyond conventional appearance (RGB) images, which collectively enhance the system’s discriminative ability.

Technical Breakdown

The dominant set clustering operates on a view similarity graph constructed from CNN-generated feature vectors, where edge weights represent pairwise view similarities. The clustering process identifies subsets of views (dominant sets) with high internal coherence and significant external dissimilarity, allowing the pooling layer to perform feature aggregation selectively within each dominant set.

The recurrent nature of the process allows the model to iteratively refine feature vectors during training. The recurrent clustering and pooling layer is recursive, meaning that the process of identifying clusters and pooling features is repeated until convergence; the final feature vector is then output for classification.

Implications and Future Directions

The research presented is particularly relevant for applications where 3D object recognition from multi-view data is required. The practical implications can be far-reaching, notably in fields such as robotics, autonomous systems, and augmented reality where efficient and accurate 3D recognition from multiple viewpoints is critical.

Theoretically, the introduction of dominant set clustering into the CNN framework opens avenues for exploring other clustering paradigms and recurrent structures for feature aggregation in different computer vision contexts. Additionally, the approach can be extended to other domains requiring efficient multi-view data analysis, such as medical imaging or surveillance.

In terms of future work, it may be beneficial to explore the adaptation of this method to different network architectures or extend the clustering and pooling approach beyond the realm of 3D object recognition to more general image classification tasks. Moreover, investigating the combination of the current approach with temporal data or dynamic scenes could yield interesting insights and applications in real-world environments.