Learning to Exploit Multiple Vision Modalities by Using Grafted Networks

Published 24 Mar 2020 in cs.CV and cs.LG | (2003.10959v3)

Abstract: Novel vision sensors such as thermal, hyperspectral, polarization, and event cameras provide information that is not available from conventional intensity cameras. An obstacle to using these sensors with current powerful deep neural networks is the lack of large labeled training datasets. This paper proposes a Network Grafting Algorithm (NGA), where a new front end network driven by unconventional visual inputs replaces the front end network of a pretrained deep network that processes intensity frames. The self-supervised training uses only synchronously-recorded intensity frames and novel sensor data to maximize feature similarity between the pretrained network and the grafted network. We show that the enhanced grafted network reaches competitive average precision (AP50) scores to the pretrained network on an object detection task using thermal and event camera datasets, with no increase in inference costs. Particularly, the grafted network driven by thermal frames showed a relative improvement of 49.11% over the use of intensity frames. The grafted front end has only 5--8% of the total parameters and can be trained in a few hours on a single GPU equivalent to 5% of the time that would be needed to train the entire object detector from labeled data. NGA allows new vision sensors to capitalize on previously pretrained powerful deep models, saving on training cost and widening a range of applications for novel sensors.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces the Network Grafting Algorithm to attach new sensor-specific front ends to pretrained networks via self-supervised feature matching.
It demonstrates a 49.11% improvement in average precision for thermal imaging and comparable performance for event cameras against traditional methods.
The approach significantly reduces the reliance on large labeled datasets, enabling efficient integration of novel sensors in practical computer vision tasks.

Learning to Exploit Multiple Vision Modalities by Using Grafted Networks

The paper "Learning to Exploit Multiple Vision Modalities by Using Grafted Networks" by Yuhuang Hu et al., presents a novel approach for incorporating information from non-traditional vision sensors, such as thermal, hyperspectral, polarization, and event cameras, into existing deep learning models that were originally designed for conventional intensity frames. This is achieved through a method called the Network Grafting Algorithm (NGA), which enhances the ability of deep networks to leverage the unique data provided by these novel sensors without the need for large labeled datasets.

Methodology and Approach

The core contribution of this work is the development of the Network Grafting Algorithm, which allows for the training of a new front-end network that can process unconventional visual inputs and can be grafted onto an existing pretrained deep network. This grafted network is capable of performing tasks without an increase in inference costs and without requiring extensive labeled data.

The NGA operates by replacing the front end of a network pretrained on intensity frames with a new front-end network that handles unconventional inputs. During the self-supervised training process, the aim is to maximize feature similarity between the pretrained network and the newly grafted network using a combination of Feature Reconstruction Loss, Feature Evaluation Loss, and Feature Style Loss. This setup enables the pretrained model to apply its learned features to new types of data. This approach effectively marries the utility of pretrained deep models with novel sensor data, facilitating a practical transfer learning method that bypasses the limitations typically imposed by the scarcity of labeled data for new sensor types.

Experimental Results

The authors demonstrate the efficacy of NGA on object detection tasks using thermal and event camera datasets. For the thermal camera dataset, the grafted network showed a relative improvement of 49.11% in average precision (AP $_{50}$ ) over the pretrained network using the same data from intensity frames. The benefits of the proposed methodology are further highlighted in the case of event cameras, where the grafted network achieved a precision close to that of the intensity-driven network. These findings are robust, taking only a few hours of training on a single GPU and utilizing the grafted network's lightweight front end (which constituted about 5-8% of total parameters).

Implications and Future Directions

The introduction of NGA opens several avenues for advancing computer vision capabilities. Practically, it allows new sensor modalities to harness the power of existing deep learning models without extensive retraining, making it feasible to integrate novel sensors into real-world applications with constrained computational resources or limited dataset availability. Theoretically, the approach embodies a significant refinement in transfer learning practices, accommodating cross-modal adaptation with reduced labeled data dependency.

In the future, as more complex sensor technologies emerge, algorithms similar to NGA could be developed to address increasing variability and complexity in sensor data. Future research might explore the extension of this method to handle even more heterogeneous data types and further optimize the architecture of the grafted front end for specific tasks. Additionally, combining insights from this approach with advances in unsupervised domain adaptation and semi-supervised learning could further enhance the robustness and applicability of multi-modal vision systems.

Overall, the work encapsulates a significant step towards more versatile, adaptable, and efficient machine learning systems in computer vision, paving the way for broader usage of advanced sensors in real-world scenarios.

Markdown Report Issue