Arena: A Patch-of-Interest ViT Inference Acceleration System for Edge-Assisted Video Analytics (2404.09245v2)
Abstract: The advent of edge computing has made real-time intelligent video analytics feasible. Previous works, based on traditional model architecture (e.g., CNN, RNN, etc.), employ various strategies to filter out non-region-of-interest content to minimize bandwidth and computation consumption but show inferior performance in adverse environments. Recently, visual foundation models based on transformers have shown great performance in adverse environments due to their amazing generalization capability. However, they require a large amount of computation power, which limits their applications in real-time intelligent video analytics. In this paper, we find visual foundation models like Vision Transformer (ViT) also have a dedicated acceleration mechanism for video analytics. To this end, we introduce Arena, an end-to-end edge-assisted video inference acceleration system based on ViT. We leverage the capability of ViT that can be accelerated through token pruning by only offloading and feeding Patches-of-Interest to the downstream models. Additionally, we design an adaptive keyframe inference switching algorithm tailored to different videos, capable of adapting to the current video content to jointly optimize accuracy and bandwidth. Through extensive experiments, our findings reveal that Arena can boost inference speeds by up to 1.58(\times) and 1.82(\times) on average while consuming only 47\% and 31\% of the bandwidth, respectively, all with high inference accuracy.
- Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461 (2022).
- Zhaowei Cai and Nuno Vasconcelos. 2019. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019), 1–1. https://doi.org/10.1109/tpami.2019.2956516
- Pumer: Pruning and merging tokens for efficient vision language models. arXiv preprint arXiv:2305.17530 (2023).
- End-to-end object detection with transformers. In European conference on computer vision. Springer, 213–229.
- Context-aware image compression optimization for visual analytics offloading. In Proceedings of the 13th ACM Multimedia Systems Conference. 27–38.
- Glimpse: Continuous, real-time object recognition on mobile devices. In Proceedings of the 13th ACM conference on embedded networked sensor systems. 155–168.
- MMTracking Contributors. 2020. MMTracking: OpenMMLab video perception toolbox and benchmark. https://github.com/open-mmlab/mmtracking.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557–570.
- Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440–1448.
- Hailo. 2024. The Impact of Powerful Edge AI on Video Analytics. https://hailo.ai/resources/industries/security/the-impact-of-powerful-edge-ai-on-video-analytics/, Last accessed on 2024-3-20.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.
- Flexible high-resolution object detection on edge devices with tunable latency. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 559–572.
- Token fusion: Bridging the gap between token pruning and token merging. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1383–1392.
- Cross-camera inference on the constrained edge. In IEEE INFOCOM 2023-IEEE Conference on Computer Communications. IEEE, 1–10.
- Exploring plain vision transformer backbones for object detection. arXiv preprint arXiv:2203.16527 (2022).
- Reducto: On-camera filtering for resource-efficient real-time video analytics. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 359–376.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800 (2022).
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
- Edge assisted real-time object detection for mobile augmented reality. In The 25th annual international conference on mobile computing and networking. 1–16.
- Adamask: Enabling machine-centric video streaming with adaptive frame masking for dnn inference offloading. In Proceedings of the 30th ACM international conference on multimedia. 3035–3044.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
- Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202–3211.
- Beyond attentive tokens: Incorporating token importance and diversity for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10334–10343.
- Token pooling in vision transformers for image classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 12–21.
- MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016).
- The 6th AI City Challenge. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE Computer Society, 3346–3355. https://doi.org/10.1109/CVPRW56347.2022.00378
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems 34 (2021), 13937–13949.
- Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
- Tinymim: An empirical study of distilling mim pre-trained models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3687–3697.
- VIPS: Real-time perception fusion for infrastructure-assisted autonomous driving. In Proceedings of the 28th Annual International Conference on Mobile Computing And Networking. 133–146.
- Smartfilter: An edge system for real-time application-guided video frames filtering. IEEE Internet of Things Journal 9, 23 (2022), 23772–23785.
- FCOS: Fully Convolutional One-Stage Object Detection. arXiv preprint arXiv:1904.01355 (2019).
- Large-scale vehicle trajectory reconstruction with camera sensing network. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 188–200.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Vabus: Edge-cloud real-time video analytics via background understanding and subtraction. IEEE Journal on Selected Areas in Communications 41, 1 (2022), 90–106.
- Joint token pruning and squeezing towards more aggressive compression of vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2092–2101.
- JAVP: Joint-Aware Video Processing with Edge-Cloud Collaboration for DNN Inference. In Proceedings of the 31st ACM International Conference on Multimedia. 9152–9160.
- EagleEye: Wearable camera-based person identification in crowded urban spaces. In Proceedings of the 26th Annual International Conference on Mobile Computing and Networking. 1–14.
- A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10809–10818.
- Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11101–11111.
- DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv:2203.03605 [cs.CV]
- Batch adaptative streaming for video analytics. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 2158–2167.
- Understanding the potential of server-driven edge video analytics. In Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications. 8–14.
- Elf: accelerate high-resolution mobile deep vision with content-aware parallel offloading. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 201–214.
- Emp: Edge-assisted multi-vehicle perception. In Proceedings of the 27th Annual International Conference on Mobile Computing and Networking. 545–558.
- Deformable DETR: Deformable Transformers for End-to-End Object Detection. In International Conference on Learning Representations. https://openreview.net/forum?id=gZ9hCDWe6ke