Multi-view Convolutional Neural Networks for 3D Shape Recognition (1505.00880v3)

Published 5 May 2015 in cs.CV and cs.GR

Abstract: A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.

Citations (3,071)

View on Semantic Scholar

Summary

The paper demonstrates that representing 3D shapes using multiple 2D views yields higher recognition accuracy than traditional 3D descriptors.
It proposes a novel multi-view CNN architecture with a view-pooling layer to aggregate features efficiently from different angles.
Empirical results on ModelNet40 show significant improvements, with the model achieving 89.6% classification accuracy and robust retrieval performance.

Multi-view Convolutional Neural Networks for 3D Shape Recognition

The paper "Multi-view Convolutional Neural Networks for 3D Shape Recognition" by Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller presents an in-depth exploration of using Convolutional Neural Networks (CNNs) based on 2D views for recognizing 3D shapes. The work explores the efficacy of view-based descriptors vs. traditional 3D descriptors, demonstrating superior performance with the former approach.

Main Contributions

The primary contributions of this paper can be broadly categorized as follows:

View-based Shape Descriptors:
- The paper challenges the traditional reliance on direct 3D descriptors such as voxel grids and polygon meshes. Instead, it introduces a method of representing 3D shapes using a collection of their rendered 2D views.
- The authors demonstrate that even a single 2D view can achieve higher recognition accuracy compared to state-of-the-art 3D shape descriptors. They postulate that this can be attributed to the finer granularity of 2D representations versus low-resolution voxel grids.
Multi-view CNN Architecture:
- A novel CNN architecture is proposed that aggregates multiple 2D views into a compact shape descriptor. This architecture, termed multi-view CNN (MVCNN), demonstrates marked improvements in recognition tasks.
- The paper shows the utility of this architecture not only for standard 3D shape recognition but also for recognizing human-drawn sketches.
Implementation and Evaluation:
- The paper was undertaken using the ModelNet40 dataset where MVCNN significantly outperforms existing methods. For instance, MVCNN achieved an 89.6% classification accuracy, a notable improvement over the 77.3% accuracy demonstrated by the state-of-the-art 3D ShapeNets.
- The paper also highlights the effectiveness of the proposed method in retrieval tasks, achieving a mean average precision (mAP) of 70.1%.

Key Methodological Insights

Data Representation:
- The experimental setup includes rendering 3D shapes from multiple fixed viewpoints. Two different camera settings, one assuming known upright orientation and the other without any orientation assumption, were utilized.
- The resulting 2D images are fed into a CNN which generates individual descriptors for each view.
Multi-view CNN Design:
- A pivotal aspect of the architecture is the view-pooling layer, which combines features from different views. The view-pooling leverages max-pooling across the views to form a single, compact descriptor.
- This aggregated descriptor allows for more efficient shape retrieval and recognition.
Transfer Learning and Fine-Tuning:
- By fine-tuning CNNs pre-trained on large 2D image databases (e.g., ImageNet) using 3D shape datasets, the model harnesses the robustness of pre-trained CNN features while adapting them to the specific task of 3D shape recognition.
Saliency Mapping:
- The paper introduces methods for generating saliency maps to visualize informative regions across the multiple views, thus offering insights into the decision-making process of MVCNN.

Empirical Results

The empirical results underscore the efficiency of multi-view CNNs. On the ModelNet40 benchmark:

The MVCNN model exhibited an 89.9% classification accuracy using 12 views.
For the more comprehensive camera setup with 80 views, the model surprisingly did not improve significantly, indicating that 12 views might suffice for most practical purposes.

Additional evaluations on the SketchClean dataset further illustrate the model's versatility by demonstrating strong performance in sketch recognition tasks. For instance, MVCNN outperformed conventional CNN models and came close to human-level accuracy.

Implications and Future Directions

The findings from this paper carry significant implications for both theoretical research and practical applications in AI:

Theoretical Implications:
- The research bridges the gap between 2D and 3D shape recognition, underscoring the potential for utilizing 2D image-based methods to approach traditionally 3D problems.
- The proposed multi-view approach can pave the way for new architecture designs that balance computational efficiency with high-dimensional spatial feature extraction in 3D tasks.
Practical Applications:
- The methodology can be directly applied to various domains such as robotics, where real-time 3D object recognition from partial views is critical.
- Enhanced retrieval systems in digital asset management and 3D content creation platforms could benefit significantly from the proposed approach.

Future research could explore adapting MVCNNs for dynamic environments and real-world applications where object orientation is not fixed, such as in video-based recognition tasks. Additionally, methods to automatically select the most informative views dynamically, rather than relying on fixed viewpoints, could optimize both efficiency and accuracy.

Overall, the work presented in "Multi-view Convolutional Neural Networks for 3D Shape Recognition" offers pivotal insights and sets a new benchmark for 3D shape recognition using learned view-based descriptors.

PDF Markdown