AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures (1905.13209v4)

Published 30 May 2019 in cs.CV, cs.LG, and cs.NE

Abstract: Learning to represent videos is a very challenging task both algorithmically and computationally. Standard video CNN architectures have been designed by directly extending architectures devised for image understanding to include the time dimension, using modules such as 3D convolutions, or by using two-stream design to capture both appearance and motion in videos. We interpret a video CNN as a collection of multi-stream convolutional blocks connected to each other, and propose the approach of automatically finding neural architectures with better connectivity and spatio-temporal interactions for video understanding. This is done by evolving a population of overly-connected architectures guided by connection weight learning. Architectures combining representations that abstract different input types (i.e., RGB and optical flow) at multiple temporal resolutions are searched for, allowing different types or sources of information to interact with each other. Our method, referred to as AssembleNet, outperforms prior approaches on public video datasets, in some cases by a great margin. We obtain 58.6% mAP on Charades and 34.27% accuracy on Moments-in-Time.

PDF Abstract

A Detailed Examination of "AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures"

The paper "AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures" aims to address the ongoing challenge of effective video representation through the introduction of a multi-stream neural architecture search algorithm—AssembleNet. This work focuses on optimizing spatio-temporal interactions and connectivity within video CNN architectures, leveraging evolutionary algorithms and connection learning. The research offers significant contributions to video understanding, outperforming previous models on datasets such as Charades and Moments-in-Time (MiT).

The underlying challenge in video representation is the need to capture both spatial and temporal characteristics effectively. Standard two-stream CNN architectures, such as those introduced by Simonyan & Zisserman, capture appearance and motion separately; however, this method’s potential combinations of various temporal resolutions and intermediate fusions pose direct limitations. AssembleNet proposes a novel multi-stream approach, automatically evolving more intricately connected architectures designed to maximize the spatio-temporal interpretability for videos.

Methodological Overview

The core contribution of AssembleNet lies in the redesign of video CNNs as directed acyclic graphs of multi-resolution, multi-stream blocks. Each block consists of space-time convolutional layers, facilitating dynamic spatio-temporal learning. The paper introduces a connection-learning-guided evolutionary algorithm, seeking optimal architecture in terms of connectivity and temporal resolutions among these multi-stream blocks. Nodes in the architecture serve as sub-networks that take inputs from lower-level nodes and connect to higher-level nodes, refined through a fitness-guided selection based on video classification tasks.

The AssembleNet approach distinguishes itself by integrating weighted connections during its evolutionary search, providing a learning mechanism that exceeds random search and standard mutation-based evolution. This directed search unveils non-obvious and efficient connections that remain elusive with conventional methods.

Empirical Results

The AssembleNet models exhibit superior performance compared to existing models across two challenging benchmark datasets. On the Charades dataset, AssembleNet achieved a remarkable mAP of 58.6%, a substantial leap over the previous best results of 45.2%. Similar enhancements are noted in the performance on the MiT dataset, with AssembleNet attaining a top-1 accuracy of 34.27%, marking it as the first model surpassing the 34% threshold.

Further comparisons between connection-learning-guided evolution, random search, and standard evolution showcase AssembleNet's edge in efficiently navigating vast architecture search spaces to arrive at superior configurations. The importance of architecture structure is underscored through ablation studies comparing evolved networks to manually designed multi-stream architectures, confirming AssembleNet's performance is not merely due to capacity but its optimized connectivity.

Theoretical and Practical Implications

The implications of AssembleNet are multifaceted. Practically, the architecture enables superior performance for complex video datasets, which are becoming increasingly crucial in numerous AI-driven applications, including autonomous systems and video analytics. Theoretically, the work extends the field of neural architecture search (NAS) by introducing a multi-stream architecture perspective, laying groundwork for future research to explore novel NAS methodologies that can handle diverse input modalities and advanced connectivity patterns.

Future Directions

The introduction of AssembleNet could foster further advancements in video understanding by inspiring exploration into NAS driven by multi-modal interactions. Future studies might extend AssembleNet's principles beyond video action recognition, potentially applying its architecture search framework to other domains requiring nuanced spatial-temporal insights, such as dynamic scene understanding and predictive modeling in robotics.

In conclusion, AssembleNet represents a noteworthy advancement in the design and optimization of video CNN architectures, contributing both practical solutions and theoretical advancements to the field of computer vision and neural architecture search for video understanding.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Michael S. Ryoo (75 papers)
AJ Piergiovanni (40 papers)
Mingxing Tan (46 papers)
Anelia Angelova (61 papers)

Citations (99)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos