Temporally Distributed Networks for Fast Video Semantic Segmentation (2004.01800v2)

Published 3 Apr 2020 in cs.CV, cs.LG, cs.MM, and eess.IV

Abstract: We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks. Leveraging the inherent temporal continuity in videos, we distribute these sub-networks over sequential frames. Therefore, at each time step, we only need to perform a lightweight computation to extract a sub-features group from a single sub-network. The full features used for segmentation are then recomposed by application of a novel attention propagation module that compensates for geometry deformation between frames. A grouped knowledge distillation loss is also introduced to further improve the representation power at both full and sub-feature levels. Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.

Citations (171)

View on Semantic Scholar

Summary

The paper introduces Temporally Distributed Networks (TDNet) that reduce per-frame computation for video semantic segmentation by distributing sub-network tasks across sequential frames.
TDNet employs an attention propagation module to robustly aggregate features across frames despite motion and uses grouped knowledge distillation to enhance network representation.
Experimental results show TDNet achieves state-of-the-art accuracy with significantly lower latency on benchmarks like Cityscapes, enabling high-speed real-time applications.

Overview of Temporally Distributed Networks for Fast Video Semantic Segmentation

The paper introduces Temporally Distributed Networks (TDNet), a novel approach designed to achieve efficient and accurate video semantic segmentation. The principal innovation of TDNet lies in its ability to leverage temporal continuity inherent in video sequences by distributing computational tasks over multiple frames, thereby reducing redundancy and improving computational efficiency.

Core Contributions

Temporal Distribution of Sub-networks: The authors propose a method whereby high-level features required for segmentation are approximated with lower-level features distributed across sequential video frames. This method capitalizes on the spatial and temporal coherence of video content. Instead of processing with a deep convolutional neural network (CNN) at each frame, shallow sub-networks handle feature extraction at each time step. This approach reduces per-frame computational cost while retaining segmentation accuracy.
Attention Propagation Module: Addressing spatial misalignment across frames due to motion, the authors introduce an attention propagation module. This module enables robust aggregation of distributed feature groups by compensating for motion and spatial deformation between frames. The use of attention mechanisms also reduces reliance on optical flow, an often computationally expensive and error-prone process.
Grouped Knowledge Distillation Loss: Knowledge distillation is employed to transfer learned knowledge from a full deep model to the distributed network. The paper introduces a novel grouped knowledge distillation loss that targets both full-feature and sub-feature levels, enhancing the representational power of the distributed networks.

Experiments and Results

The authors test TDNet on three benchmark datasets: Cityscapes, CamVid, and NYUD-v2. TDNet demonstrates state-of-the-art accuracy with substantial improvements in processing speed and latency. For instance, in the Cityscapes dataset, TDNet variants outperform existing methods while maintaining significantly lower latency, proving the efficacy of temporal distribution and attention mechanisms in real-world applications.

Implications and Future Directions

TDNet has significant implications for applications requiring high-speed video processing, such as autonomous driving, robotic vision, and augmented reality environments. By distributing computational workload across sequential frames, TDNet can facilitate real-time semantic segmentation tasks with reduced hardware demands.

Future developments might explore the integration of mixed-modal data (e.g., depth information) or the scaling of TDNet for use in large-scale, multi-camera systems. Additionally, further research may identify benefits in extending temporal distributed techniques to other temporal tasks in AI, such as activity recognition or event detection in video streams.

In summary, TDNet presents a proficient framework for fast video semantic segmentation, balancing accuracy and efficiency by innovative utilization of temporal continuity and attention mechanisms. This paper adds a valuable dimension to ongoing research in video processing technologies, suggesting avenues for reducing computational costs while maintaining high segmentation performance.