StreamTinyNet: video streaming analysis with spatial-temporal TinyML (2407.17524v1)

Published 22 Jul 2024 in cs.CV and cs.AI

Abstract: Tiny Machine Learning (TinyML) is a branch of Machine Learning (ML) that constitutes a bridge between the ML world and the embedded system ecosystem (i.e., Internet of Things devices, embedded devices, and edge computing units), enabling the execution of ML algorithms on devices constrained in terms of memory, computational capabilities, and power consumption. Video Streaming Analysis (VSA), one of the most interesting tasks of TinyML, consists in scanning a sequence of frames in a streaming manner, with the goal of identifying interesting patterns. Given the strict constraints of these tiny devices, all the current solutions rely on performing a frame-by-frame analysis, hence not exploiting the temporal component in the stream of data. In this paper, we present StreamTinyNet, the first TinyML architecture to perform multiple-frame VSA, enabling a variety of use cases that requires spatial-temporal analysis that were previously impossible to be carried out at a TinyML level. Experimental results on public-available datasets show the effectiveness and efficiency of the proposed solution. Finally, StreamTinyNet has been ported and tested on the Arduino Nicla Vision, showing the feasibility of what proposed.

PDF Abstract

StreamTinyNet: Video Streaming Analysis with Spatial-Temporal TinyML

"StreamTinyNet: Video Streaming Analysis with Spatial-Temporal TinyML" presents a significant advancement in the domain of Tiny Machine Learning (TinyML) by addressing the challenge of the spatial-temporal analysis of video streams on resource-constrained devices. Authored by Hazem Hesham Yousef Shalby, Massimo Pavan, and Manuel Roveri, the paper introduces StreamTinyNet, an innovative neural network architecture for performing multi-frame video streaming analysis (VSA) on tiny devices like Arduino Nicla Vision.

Overview and Motivation

The core challenge addressed in this work is the limitation of traditional TinyML approaches in handling video data frame-by-frame. Conventional methods fail to exploit temporal data effectively, thereby limiting their application in tasks that require understanding the evolution of scenes over time. StreamTinyNet aims to bridge this gap by introducing a methodology that integrates spatial and temporal information, enabling various use cases such as gesture recognition and event detection, which are otherwise infeasible with simpler frame-by-frame analysis.

Proposed Architecture

StreamTinyNet's architecture consists of two main components:

Spatial Frame-by-Frame Feature Extraction ( $g(\cdot)$ ): This stage consists of a convolutional neural network (CNN), designed to reduce the dimensionality of each frame individually and extract pertinent spatial features.
Temporal Combination of Extracted Features ( $h(\cdot)$ ): In this stage, the temporal sequence of the extracted features is analyzed using a three-step pipeline, which includes splitting the feature maps, performing $1 \times 1$ convolutions along the temporal axis, and using fully connected layers for final classification.

This dual-step approach allows the separation of spatial and temporal processing, significantly reducing computational redundancy when analyzing multiple frames.

Experimental Analysis

The effectiveness of StreamTinyNet was evaluated on two primary tasks: gesture recognition using the Jester dataset and event detection using the GolfDB dataset.

Gesture Recognition

For gesture recognition, StreamTinyNet achieved an accuracy of 0.81, significantly outperforming traditional TinyML approaches like MobileNetV1, MobileNetV2, and MCUNet, which had accuracies around 0.34 to 0.40. Notably, StreamTinyNet's computational footprint and memory demand were substantially lower, demonstrating both efficiency and efficacy.

Event Detection

In the context of event detection, StreamTinyNet was tested on the GolfDB dataset, achieving a 56% Percentage of Correct Events (PCE). This result is markedly better than those of single-frame TinyML models. Furthermore, the evaluation explored the robustness of StreamTinyNet to varying window sizes (T), showing that larger windows improved performance without adding significant computational overhead.

Implications and Future Work

The implications of this work are multifaceted:

Practical Applications: StreamTinyNet can be deployed for real-time video analysis on low-power, resource-constrained devices, making it ideal for applications in smart surveillance, wearable technology, and autonomous systems.
Theoretical Advancements: This architecture sets a benchmark for future research in TinyML, highlighting the importance of spatial-temporal analysis and encouraging the development of more complex models equipped to handle video data efficiently.

Future developments will explore adaptive frame rates to optimize power consumption, sensor drift detection mechanisms, incremental training on devices, and integrating early exit strategies to further enhance efficiency.

Conclusion

StreamTinyNet represents a substantial contribution to the field of TinyML, addressing a crucial limitation in video analysis by integrating spatial-temporal processing into a single, efficient architecture. Its ability to perform high-accuracy, low-latency video classification on tiny devices without significant computational and memory overhead marks a notable advancement in the sphere of embedded machine learning. This paper paves the way for more sophisticated TinyML models that better exploit temporal data, thus broadening the horizon of on-device applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Hazem Hesham Yousef Shalby (4 papers)
Massimo Pavan (4 papers)
Manuel Roveri (24 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/arduino/status/1816864838003622386

https://twitter.com/ai_papers/status/1817005758225231952

https://twitter.com/MuzafferKal_/status/1846640782138397054