Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 451 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection (2401.09923v2)

Published 18 Jan 2024 in cs.CV

Abstract: State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Object detection in video with spatiotemporal sampling networks. In ECCV.
  2. Memory Enhanced Global-Local Aggregation for Video Object Detection. In CVPR.
  3. R-fcn: Object detection via region-based fully convolutional networks. In NIPS.
  4. Object Guided External Memory Network for Video Object Detection. In ICCV.
  5. Relation Distillation Networks for Video Object Detection. In ICCV.
  6. Flownet: Learning optical flow with convolutional networks. In ICCV.
  7. Detect to track and track to detect. In ICCV.
  8. Girshick, R. 2015. Fast r-cnn. In ICCV.
  9. Region-based convolutional networks for accurate object detection and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(1): 142–158.
  10. Progressive Sparse Local Attention for Video Object Detection. In ICCV.
  11. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 .
  12. Deep residual learning for image recognition. In CVPR.
  13. Relation networks for object detection. In CVPR.
  14. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.
  15. Object detection in videos with tubelet proposal networks. In CVPR.
  16. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology 28(10): 2896–2907.
  17. Object Detection in Video with Spatial-temporal Context Aggregation. arXiv preprint arXiv:1907.04988 .
  18. You only look once: Unified, real-time object detection. In CVPR.
  19. YOLO9000: better, faster, stronger. In CVPR.
  20. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
  21. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3): 211–252.
  22. Leveraging Long-Range Temporal Relationships Between Proposals for Video Object Detection. In ICCV.
  23. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
  24. Going deeper with convolutions. In CVPR.
  25. FCOS: Fully Convolutional One-Stage Object Detection. arXiv preprint arXiv:1904.01355 .
  26. Attention is all you need. In NIPS.
  27. Fully motion-aware network for video object detection. In ECCV.
  28. Sequence Level Semantics Aggregation for Video Object Detection. In ICCV.
  29. Aggregated residual transformations for deep neural networks. In CVPR.
  30. Towards high performance video object detection for mobiles. arXiv preprint arXiv:1804.05830 .
  31. Flow-guided feature aggregation for video object detection. In ICCV.
  32. Deep feature flow for video recognition. In CVPR.
Citations (47)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube