360VOTS: Visual Object Tracking and Segmentation in Omnidirectional Videos (2404.13953v1)
Abstract: Visual object tracking and segmentation in omnidirectional videos are challenging due to the wide field-of-view and large spherical distortion brought by 360{\deg} images. To alleviate these problems, we introduce a novel representation, extended bounding field-of-view (eBFoV), for target localization and use it as the foundation of a general 360 tracking framework which is applicable for both omnidirectional visual object tracking and segmentation tasks. Building upon our previous work on omnidirectional visual object tracking (360VOT), we propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS). The 360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories. To support both the development and evaluation of algorithms in this domain, we divide the dataset into a training subset with 170 sequences and a testing subset with 120 sequences. Furthermore, we tailor evaluation metrics for both omnidirectional tracking and segmentation to ensure rigorous assessment. Through extensive experiments, we benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset. Homepage: https://360vots.hkustvgd.com/
- J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” TPAMI, vol. 37, no. 3, pp. 583–596, 2014.
- H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in CVPR, 2016.
- M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg, “Eco: Efficient convolution operators for tracking,” in CVPR, 2017, pp. 6638–6646.
- B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in CVPR, 2019, pp. 4282–4291.
- G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in ICCV, 2019, pp. 6182–6191.
- S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in ICCV, 2019.
- B. Yan, Y. Jiang, P. Sun, D. Wang, Z. Yuan, P. Luo, and H. Lu, “Towards grand unification of object tracking,” in ECCV, vol. 13681, 2022, pp. 733–751.
- Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by foreground-background integration,” in ECCV, vol. 12350, 2020, pp. 332–348.
- Z. Yang and Y. Yang, “Decoupling features in hierarchical propagation for video object segmentation,” in NeurIPS, vol. 35, 2022, pp. 36 324–36 336.
- H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” in ECCV, vol. 13688, 2022, pp. 640–658.
- Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” TPAMI, vol. 37, no. 9, pp. 1834–1848, 2015.
- M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for uav tracking,” in ECCV, 2016, pp. 445–461.
- M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. Čehovin, “A novel performance evaluation methodology for single-target trackers,” TPAMI, vol. 38, no. 11, pp. 2137–2155, Nov 2016.
- H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling, “Lasot: A high-quality benchmark for large-scale single object tracking,” in CVPR, 2019, pp. 5374–5383.
- L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” TPAMI, vol. 43, no. 5, pp. 1562–1577, 2019.
- P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” TPAMI, vol. 36, no. 6, pp. 1187 – 1200, Jun 2014, preprint.
- L. Hong, W. Chen, Z. Liu, W. Zhang, P. Guo, Z. Chen, and W. Zhang, “Lvos: A benchmark for long-term video object segmentation,” in ICCV, 2023, pp. 13 480–13 492.
- J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv:1704.00675, 2017.
- N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. S. Huang, “Youtube-vos: A large-scale video object segmentation benchmark,” CoRR, vol. abs/1809.03327, 2018.
- F. Dai, B. Chen, H. Xu, Y. Ma, X. Li, B. Feng, P. Yuan, C. Yan, and Q. Zhao, “Unbiased iou for spherical image object detection,” in AAAI, vol. 36, 2022, pp. 508–515.
- H. Xu, Q. Zhao, Y. Ma, X. Li, P. Yuan, B. Feng, C. Yan, and F. Dai, “Pandora: A panoramic detection dataset for object with orientation,” in ECCV, 2022.
- H. Huang, Y. Xu, Y. Chen, and S.-K. Yeung, “360vot: A new benchmark dataset for omnidirectional visual object tracking,” in ICCV, 2023.
- A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” TPAMI, vol. 36, no. 7, pp. 1442–1468, 2013.
- A. Li, M. Lin, Y. Wu, M.-H. Yang, and S. Yan, “Nus-pro: A new visual tracking challenge,” TPAMI, vol. 38, no. 2, pp. 335–349, 2015.
- P. Liang, E. Blasch, and H. Ling, “Encoding color information for visual tracking: Algorithms and benchmark,” IEEE transactions on image processing, vol. 24, no. 12, pp. 5630–5644, 2015.
- S. Li and D.-Y. Yeung, “Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models,” in AAAI, vol. 31, 2017.
- H. Kiani Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey, “Need for speed: A benchmark for higher frame rate object tracking,” in ICCV, 2017, pp. 1125–1134.
- D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection and tracking,” in ECCV, 2018, pp. 370–386.
- J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao, A. Vedaldi, A. W. Smeulders, P. H. Torr, and E. Gavves, “Long-term tracking in the wild: A benchmark,” in ECCV, 2018, pp. 670–685.
- M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in ECCV, 2018, pp. 300–317.
- H. Fan, H. A. Miththanthaya, S. R. Rajan, X. Liu, Z. Zou, Y. Lin, H. Ling et al., “Transparent object tracking benchmark,” in ICCV, 2021, pp. 10 734–10 743.
- M. Dunnhofer, A. Furnari, G. M. Farinella, and C. Micheloni, “Is first person vision challenging for object tracking?” in ICCV, 2021, pp. 2698–2710.
- E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke, “Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video,” in CVPR, 2017, pp. 5296–5305.
- Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in CVPR, 2013, pp. 2411–2418.
- Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware siamese networks for visual object tracking,” in ECCV, 2018, pp. 101–117.
- H. Huang and S.-K. Yeung, “Siamx: An efficient long-term tracker using cross-level feature correlation and adaptive tracking scheme,” in ICRA, 2022.
- M. Paul, M. Danelljan, C. Mayer, and L. V. Gool, “Robust visual tracking by segmentation,” in ECCV, vol. 13682, 2022, pp. 571–588.
- Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr, “Fast online object tracking and segmentation: A unifying approach,” in CVPR, 2019, pp. 1328–1338.
- P. Zhao, A. You, Y. Zhang, J. Liu, K. Bian, and Y. Tong, “Spherical criteria for fast and accurate 360° object detection,” AAAI, vol. 34, no. 07, p. 12959–12966, 2020.
- H. Huang and S.-K. Yeung, “360vo: Visual odometry using a single 360 camera,” in ICRA, 2022, pp. 5594–5600.
- F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in CVPR, 2016.
- D. Martin, C. Fowlkes, and J. Malik, “Learning to detect natural image boundaries using local brightness, color, and texture cues,” TPAMI, vol. 26, no. 5, pp. 530–549, 2004.
- B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in ICCV, 2021, pp. 10 448–10 457.
- C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” in CVPR, 2022, pp. 8731–8740.
- Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” in CVPR, 2022, pp. 13 608–13 618.
- B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” arXiv preprint arXiv:2203.05328, 2022.
- S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” in ECCV, 2022, pp. 146–164.
- Z. Zhang and H. Peng, “Deeper and wider siamese networks for real-time visual tracking,” in CVPR, June 2019.
- Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in CVPR, 2020, pp. 6668–6677.
- Z. Zhang, Y. Liu, X. Wang, B. Li, and W. Hu, “Learn to match: Automatic matching network design for visual tracking,” in CVPR, 2021, pp. 13 339–13 348.
- Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in ECCV, 2020, pp. 771–787.
- N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li, “Unsupervised deep tracking,” in CVPR, 2019.
- E. Park and A. C. Berg, “Meta-tracker: Fast and robust online adaptation for visual object trackers,” in ECCV, 2018, pp. 569–585.
- M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” in CVPR, 2019, pp. 4660–4669.
- G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Know your surroundings: Exploiting scene information for object tracking,” in ECCV, 2020, pp. 205–221.
- M. Danelljan, L. V. Gool, and R. Timofte, “Probabilistic regression for visual tracking,” in CVPR, 2020, pp. 7183–7192.
- X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and L. V. Gool, “Video object segmentation with episodic graph memory networks,” in ECCV, vol. 12348, 2020, pp. 661–679.
- H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Rethinking space-time networks with improved memory coverage for efficient video object segmentation,” in NeurIPS, vol. 34, 2021, pp. 11 781–11 794.
- M. Bekuzarov, A. Bermudez, J.-Y. Lee, and H. Li, “Xmem++: Production-level video segmentation from few annotated frames,” in ICCV, October 2023, pp. 635–644.
- A. Athar, A. Hermans, J. Luiten, D. Ramanan, and B. Leibe, “Tarvis: A unified approach for target-based video segmentation,” in CVPR, June 2023, pp. 18 738–18 748.
- Z. Yang, Y. Wei, and Y. Yang, “Associating objects with transformers for video object segmentation,” in NeurIPS, vol. 34, 2021, pp. 2491–2502.
- Y. Liang, X. Li, N. Jafari, and J. Chen, “Video object segmentation with adaptive feature bank and uncertain-region refinement,” in NeurIPS, vol. 33, 2020, pp. 3430–3441.
- Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by multi-scale foreground-background integration,” TPAMI, vol. 44, no. 9, pp. 4701–4712, 2021.
- G. Bhat, F. J. Lawin, M. Danelljan, A. Robinson, M. Felsberg, L. V. Gool, and R. Timofte, “Learning what to learn for video object segmentation,” in ECCV, vol. 12347, 2020, pp. 777–794.
- S. Cho, H. Lee, M. Lee, C. Park, S. Jang, M. Kim, and S. Lee, “Tackling background distraction in video object segmentation,” in ECCV, vol. 13682, 2022, pp. 446–462.
- Y. Mao, N. Wang, W. Zhou, and H. Li, “Joint inductive and transductive learning for video object segmentation,” in ICCV, 2021, pp. 9670–9679.
- H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion,” in CVPR, 2021.
- B. Coors, A. P. Condurache, and A. Geiger, “Spherenet: Learning spherical representations for detection and classification in omnidirectional images,” in ECCV, 2018, pp. 518–533.
- M. Defferrard, M. Milani, F. Gusset, and N. Perraudin, “Deepsphere: a graph-based spherical cnn,” arXiv preprint arXiv:2012.15000, 2020.
- Yinzhe Xu (2 papers)
- Huajian Huang (12 papers)
- Yingshu Chen (9 papers)
- Sai-Kit Yeung (52 papers)