Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TAO-Amodal: A Benchmark for Tracking Any Object Amodally (2312.12433v3)

Published 19 Dec 2023 in cs.CV, cs.AI, and cs.LG

Abstract: Amodal perception, the ability to comprehend complete object structures from partial visibility, is a fundamental skill, even for infants. Its significance extends to applications like autonomous driving, where a clear understanding of heavily occluded objects is essential. However, modern detection and tracking algorithms often overlook this critical capability, perhaps due to the prevalence of \textit{modal} annotations in most benchmarks. To address the scarcity of amodal benchmarks, we introduce TAO-Amodal, featuring 833 diverse categories in thousands of video sequences. Our dataset includes \textit{amodal} and modal bounding boxes for visible and partially or fully occluded objects, including those that are partially out of the camera frame. We investigate the current lay of the land in both amodal tracking and detection by benchmarking state-of-the-art modal trackers and amodal segmentation methods. We find that existing methods, even when adapted for amodal tracking, struggle to detect and track objects under heavy occlusion. To mitigate this, we explore simple finetuning schemes that can increase the amodal tracking and detection metrics of occluded objects by 2.1\% and 3.3\%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Abien Fred Agarap. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
  2. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1674–1683, 2023.
  3. Object permanence in young infants: Further evidence. Child development, 62(6):1227–1246, 1991.
  4. Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pages 3464–3468. IEEE, 2016.
  5. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
  6. Memot: Multi-object tracking with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8090–8100, 2022.
  7. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8748–8757, 2019.
  8. Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3339–3348, 2018.
  9. Soccernet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3491–3502, 2022.
  10. Tao: A large-scale benchmark for tracking any object. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 436–454. Springer, 2020.
  11. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003, 2020.
  12. 1st place solution to eccv-tao-2020: Detect and represent any object for tracking. arXiv preprint arXiv:2101.08040, 2021.
  13. Strongsort: Make deepsort great again. IEEE Transactions on Multimedia, 2023.
  14. Segan: Segmenting and generating the invisible. In CVPR, 2018.
  15. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  16. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019.
  17. Qdtrack: Quasi-dense similarity learning for appearance-only multiple object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  18. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, pages 3354–3361. IEEE, 2012.
  19. Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  20. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6047–6056, 2018.
  21. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019.
  22. Mmptrack: Large-scale densely annotated multi-camera multiple people tracking benchmark. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4860–4869, 2023.
  23. Mask R-CNN. In ICCV, 2017.
  24. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 558–567, 2019.
  25. Sail-vos: Semantic amodal instance level video object segmentation-a synthetic dataset and baselines. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3105–3115, 2019.
  26. Sail-vos 3d: A synthetic dataset and baselines for object detection and 3d mesh reconstruction from video data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1418–1428, 2021.
  27. Michael Kavsek. The influence of context on amodal completion in 5-and 7-month-old infants. Journal of Cognition and Development, 5(2):159–184, 2004.
  28. A robust learning approach to domain adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 480–490, 2019.
  29. Detecting invisible people. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3174–3184, 2021.
  30. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages, 2017.
  31. The sixth visual object tracking vot2018 challenge results. In Proceedings of the European conference on computer vision (ECCV) workshops, pages 0–0, 2018.
  32. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942, 2015.
  33. Amodal instance segmentation. In ECCV. Springer, 2016.
  34. Tracking every thing in the wild. In European Conference on Computer Vision, pages 498–515. Springer, 2022a.
  35. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pages 280–296. Springer, 2022b.
  36. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  37. Variational amodal object completion. Advances in Neural Information Processing Systems, 33:16246–16257, 2020.
  38. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  39. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  40. Development of modal and amodal completion in infants. Perception, 35(9):1251–1264, 2006.
  41. Amodal instance segmentation with KINS dataset. In CVPR, 2019.
  42. Carfusion: Combining point tracking and part detection for dynamic 3d reconstruction of vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1906–1915, 2018.
  43. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  44. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 510–526. Springer, 2016.
  45. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  46. Spot: Spatiotemporal modeling for 3d object tracking. In European Conference on Computer Vision, pages 639–656. Springer, 2022.
  47. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2446–2454, 2020.
  48. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20993–21002, 2022.
  49. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  50. Learning to track with object permanence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10860–10869, 2021.
  51. Object permanence emerges in a random walk along memory. arXiv preprint arXiv:2204.01784, 2022.
  52. Tracking through containers and occluders in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13802–13812, 2023.
  53. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  54. Mots: Multi-object tracking and segmentation. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 7942–7951, 2019.
  55. Argoverse 2.0: Next generation datasets for self-driving perception and forecasting. In NeuRIPS Datasets and Benchmarks Track (Round 2), 2021.
  56. Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pages 3645–3649. IEEE, 2017.
  57. Track to detect and segment: An online multi-object tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12352–12361, 2021.
  58. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  59. Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019.
  60. Tracking objects as points. arXiv:2004.01177, 2020.
  61. Global tracking transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8771–8780, 2022.
  62. Semantic amodal segmentation. In CVPR, 2017.

Summary

  • The paper presents a novel dataset, TAO-Amodal, which augments TAO with comprehensive amodal annotations covering occluded, out-of-frame, and fully invisible objects.
  • It introduces a lightweight, plug-in Amodal Expander module that transforms conventional trackers to predict the full extent of objects.
  • The method achieves performance gains with a 3.3% boost in detection, a 1.6% improvement in tracking, and a 2x increase for pedestrian tracking.

Introduction to Amodal Perception

Amodal perception is a fundamental skill that allows us to understand the complete structure of partially visible objects. This ability is critical in real-world applications such as autonomous driving, where it's essential to have a clear understanding of objects that are heavily occluded. However, many modern detection and tracking algorithms often overlook the importance of amodal perception, primarily due to the scarcity of datasets with amodal annotations.

The TAO-Amodal Dataset

Addressing the lack of amodal data, a new benchmark dataset termed TAO-Amodal has been introduced, which augments the existing TAO dataset with amodal bounding box annotations. These annotations cover fully invisible, out-of-frame, and occluded objects. TAO-Amodal is significant for its inclusion of 880 diverse categories in thousands of video sequences, providing an extensive evaluation framework for assessing the occlusion reasoning capabilities of current trackers.

Improving Amodal Tracking

To solve the challenge of amodal tracking, an innovative approach has been developed that transforms conventional trackers into amodal trackers. The method leverages a lightweight plug-in module called an "Amodal Expander". This module is fine-tuned on a few hundred video sequences with data augmentation to enhance amodal tracking capabilities, resulting in a 3.3% and 1.6% improvement on the detection and tracking of occluded objects, respectively. Dramatic improvements are seen when evaluated on people, with a 2x improvement in performance compared to state-of-the-art modal baselines.

Amodal Expander Module

The Amodal Expander is a class-agnostic module that expands modal bounding box predictions to cover the full extent of objects, including those that are occluded. It is designed to be lightweight and easily transferrable across different classes. The training process for the expander is based on matching modal box predictions to modal ground truth, then applying regression loss to amodal predictions against the amodal ground truth. This module's introduction has led to significant advancements, particularly in detecting people in amodal tracking scenarios.

Conclusion

The TAO-Amodal dataset and the Amodal Expander represent crucial advancements in the field of object detection and tracking. They bring attention to amodal perception and provide a framework for developing algorithms that better understand occluded and out-of-frame objects. The research showcases that with the right dataset and methodologies, amodal perception can be significantly improved, which is vital for applications requiring an accurate interpretation of the visual world.

X Twitter Logo Streamline Icon: https://streamlinehq.com