MFAS: Multimodal Fusion Architecture Search (1903.06496v1)

Published 15 Mar 2019 in cs.LG, cs.CV, and cs.NE

Abstract: We tackle the problem of finding good architectures for multimodal classification problems. We propose a novel and generic search space that spans a large number of possible fusion architectures. In order to find an optimal architecture for a given dataset in the proposed search space, we leverage an efficient sequential model-based exploration approach that is tailored for the problem. We demonstrate the value of posing multimodal fusion as a neural architecture search problem by extensive experimentation on a toy dataset and two other real multimodal datasets. We discover fusion architectures that exhibit state-of-the-art performance for problems with different domain and dataset size, including the NTU RGB+D dataset, the largest multi-modal action recognition dataset available.

Citations (170)

View on Semantic Scholar

Summary

The paper introduces a neural architecture search space that optimizes feature fusion for enhanced multimodal classification.
It employs sequential model-based optimization with temperature-scheduled sampling to efficiently explore complex fusion architectures while reducing computational cost.
Empirical results on datasets like NTU RGB+D demonstrate state-of-the-art performance over traditional fixed late fusion techniques.

Insights into Multimodal Fusion Architecture Search

The paper "MFAS: Multimodal Fusion Architecture Search" presents a compelling approach to addressing the multimodal classification problem through the lens of neural architecture search. It proposes a novel search space that encompasses a variety of potential fusion architectures, utilizing sequential model-based optimization (SMBO) to navigate this space. This approach is applied across multiple datasets, demonstrating a notable improvement in performance metrics, particularly in action recognition tasks.

Search Space and Methodology

The authors introduce a search space designed for multimodal fusion, conceptualized as a combinatorial problem. This space is characterized by variable-length vectors representing potential architectures, allowing for complex interactions between different layers of neural networks dedicated to separate modalities. The exploration within this space is driven by a surrogate function, leveraging a temperature-scheduled sampling approach to guide the search progressively from simple to more complex architectures.

Sequential model-based optimization provides a robust framework for this exploration, effectively predicting performance with minimal computational resources compared to traditional exhaustive methods. The paper's approach is nuanced with efficient implementations like weight-sharing among fusion architectures, reducing memory usage and speeding up the search process significantly.

Key Contributions

The paper primarily contributes through extensive experimentation, validating the importance of optimal feature fusion and the efficacy of the proposed method. Notable contributions include:

Empirical Validation: It presents evidence on toy datasets showcasing the relevance of exploration in achieving superior multimodal fusion. The results emphasize the impact of optimally combining features from varying layers rather than relying on fixed late fusion techniques.
Search Space Definition: A comprehensive search space is established that acts as a superset to existing fusion methodologies, enabling a wide range of fusion strategies to be explored.
Adapted Search Approach: An adaptation of AutoML methodologies to multimodal fusion scenarios is proposed, allowing target architectures to align more precisely with the specific problem constraints and complexities.
State-of-the-Art Performance: Automatically discovered architectures exhibit superior performance across multiple datasets, including the NTU RGB+D action recognition dataset—a challenging benchmark in multimodal evaluation.

Implications and Future Directions

The outcomes of this research accentuate the importance of applying neural architecture search principles to the domain of multimodal classification. This approach significantly enhances performance outcomes by optimizing how intermediate representations are fused across modalities. The results imply that multimodal tasks can benefit greatly from dynamically composed architectures tailored to integrate complex feature hierarchies.

Theoretically, this paper expands the discourse on neural network interpretability by suggesting practical implementations of feature hierarchies across modalities, aligning with ongoing developments in representation learning.

Practical applications are abundant, with potential impact on areas like autonomous systems, speech recognition, and video analytics, where multimodality is increasingly prevalent.

Future developments in AI may see increased sophistication in the exploration methods, including more advanced surrogates that leverage deep learning for performance prediction, and dynamic adjustments in the search algorithm based on real-time evaluations. Additionally, expanding this framework to encompass real-time adaptive learning scenarios where multimodal input configurations evolve could enhance the applicability of the proposed system in deploying scalable AI solutions.

This paper marks a significant step toward refining multimodal data integration, laying a foundational path for future research to harness the complexity of multimodal fusion through automated, intelligent architecture search strategies.

PDF Markdown