MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection (2401.09923v2)
Abstract: State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.
- Object detection in video with spatiotemporal sampling networks. In ECCV.
- Memory Enhanced Global-Local Aggregation for Video Object Detection. In CVPR.
- R-fcn: Object detection via region-based fully convolutional networks. In NIPS.
- Object Guided External Memory Network for Video Object Detection. In ICCV.
- Relation Distillation Networks for Video Object Detection. In ICCV.
- Flownet: Learning optical flow with convolutional networks. In ICCV.
- Detect to track and track to detect. In ICCV.
- Girshick, R. 2015. Fast r-cnn. In ICCV.
- Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(1): 142–158.
- Progressive Sparse Local Attention for Video Object Detection. In ICCV.
- Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 .
- Deep residual learning for image recognition. In CVPR.
- Relation networks for object detection. In CVPR.
- Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.
- Object detection in videos with tubelet proposal networks. In CVPR.
- T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology 28(10): 2896–2907.
- Object Detection in Video with Spatial-temporal Context Aggregation. arXiv preprint arXiv:1907.04988 .
- You only look once: Unified, real-time object detection. In CVPR.
- YOLO9000: better, faster, stronger. In CVPR.
- Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3): 211–252.
- Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection. In ICCV.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
- Going deeper with convolutions. In CVPR.
- FCOS: Fully Convolutional One-Stage Object Detection. arXiv preprint arXiv:1904.01355 .
- Attention is all you need. In NIPS.
- Fully motion-aware network for video object detection. In ECCV.
- Sequence Level Semantics Aggregation for Video Object Detection. In ICCV.
- Aggregated residual transformations for deep neural networks. In CVPR.
- Towards high performance video object detection for mobiles. arXiv preprint arXiv:1804.05830 .
- Flow-guided feature aggregation for video object detection. In ICCV.
- Deep feature flow for video recognition. In CVPR.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.