Papers
Topics
Authors
Recent
2000 character limit reached

MED-VT++: Unifying Multimodal Learning with a Multiscale Encoder-Decoder Video Transformer (2304.05930v3)

Published 12 Apr 2023 in cs.CV

Abstract: In this paper, we present an end-to-end trainable unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in video. The presented Multiscale Encoder-Decoder Video Transformer (MED-VT) uses multiscale representation throughout and employs an optional input beyond video (e.g., audio), when available, for multimodal processing (MED-VT++). Multiscale representation at both encoder and decoder yields three key benefits: (i) implicit extraction of spatiotemporal features at different levels of abstraction for capturing dynamics without reliance on input optical flow, (ii) temporal consistency at encoding and (iii) coarse-to-fine detection for high-level (e.g., object) semantics to guide precise localization at decoding. Moreover, we present a transductive learning scheme through many-to-many label propagation to provide temporally consistent video predictions. We showcase MED-VT/MED-VT++ on three unimodal video segmentation tasks (Automatic Video Object Segmentation (AVOS), actor-action segmentation and Video Semantic Segmentation (VSS)) as well as a multimodal segmentation task (Audio-Visual Segmentation (AVS)). Results show that the proposed architecture outperforms alternative state-of-the-art approaches on multiple benchmarks using only video (and optional audio) as input, without reliance on optical flow. Finally, to document details of the model's internal learned representations, we present a detailed interpretability study, encompassing both quantitative and qualitative analyses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (100)
  1. S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022.
  2. C.-F. R. Chen, Q. Fan, and R. Panda, “CrossViT: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 357–366.
  3. B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1290–1299.
  4. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6824–6835.
  5. Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4804–4814.
  6. W. Wang, L. Yao, L. Chen, B. Lin, D. Cai, X. He, and W. Liu, “Crossformer: A versatile vision transformer hinging on cross-scale attention,” in International Conference on Learning Representations, 2022.
  7. M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” in Proceedings of the International Conference on Machine Learning, 2017, pp. 3319–3328.
  8. B. Pan, R. Panda, Y. Jiang, Z. Wang, R. Feris, and A. Oliva, “IA-RED22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Interpretability-aware redundancy reduction for vision transformers,” in Proceedings of the Conference on Advances in Neural Information Processing Systems, 2021, pp. 24 898–24 911.
  9. S. Tan, R. Caruana, G. Hooker, P. Koch, and A. Gordo, “Learning global additive explanations for neural nets using model distillation,” arXiv preprint arXiv:1801.08640, 2018.
  10. M. Wu, M. Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. Doshi-Velez, “Beyond sparsity: Tree regularization of deep models for interpretability,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, pp. 1670–1678.
  11. A. Botach, E. Zheltonozhskii, and C. Baskin, “End-to-end referring video object segmentation with multimodal transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4985–4995.
  12. J. Wu, Y. Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4974–4984.
  13. B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Spectrum-guided multi-granularity referring video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 920–930.
  14. W. Pan, H. Shi, Z. Zhao, J. Zhu, X. He, Z. Pan, L. Gao, J. Yu, F. Wu, and Q. Tian, “Wnet: Audio-guided video object segmentation via wavelet-based cross-modal denoising networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1320–1331.
  15. J. Zhou, J. Wang, J. Zhang, W. Sun, J. Zhang, S. Birchfield, D. Guo, L. Kong, M. Wang, and Y. Zhong, “Audio–visual segmentation,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 386–403.
  16. Y. Mao, J. Zhang, M. Xiang, Y. Lv, Y. Zhong, and Y. Dai, “Contrastive conditional latent diffusion for audio-visual segmentation,” arXiv preprint arXiv:2307.16579, 2023.
  17. S. Gao, Z. Chen, G. Chen, W. Wang, and T. Lu, “AVSegFormer: Audio-visual segmentation with transformer,” arXiv preprint arXiv:2307.01146, 2023.
  18. V. Vapnik, “Transductive inference and semi-supervised learning,” in Semi-Supervised Learning.   MIT press, 2006, ch. 24, p. 454–472.
  19. D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Proceedings of the Conference on Advances in Neural Information Processing Systems, vol. 16, 2003, pp. 321–328.
  20. R. Karim, H. Zhao, R. P. Wildes, and M. Siam, “MED-VT: Multiscale encoder-decoder video transformer with application to object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6323–6333.
  21. T. Zhou, F. Porikli, D. J. Crandall, L. Van Gool, and W. Wang, “A survey on deep learning technique for video segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7099–7122, 2022.
  22. R. Karim and R. P. Wildes, “Understanding video transformers for segmentation: A survey of application and interpretability,” arXiv preprint arXiv:2310.12296, 2023.
  23. K. Dang, C. Zhou, Z. Tu, M. Hoy, J. Dauwels, and J. Yuan, “Actor-action semantic segmentation with region masks,” in Proceedings of the British Machine Vision Conference, 2018.
  24. X. Yan, H. Gong, Y. Jiang, S.-T. Xia, F. Zheng, X. You, and L. Shao, “Video scene parsing: An overview of deep learning methods and datasets,” Computer Vision and Image Understanding, vol. 201, 2020.
  25. C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i Nieto, “RVOS: End-to-end recurrent network for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5277–5286.
  26. J. Luiten, I. E. Zulfikar, and B. Leibe, “UNOVOST: Unsupervised offline video object segmentation and tracking,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 2000–2009.
  27. S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9226–9235.
  28. H. Xie, H. Yao, S. Zhou, S. Zhang, and W. Sun, “Efficient regional memory network for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1286–1295.
  29. T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao, “Motion-attentive transition for zero-shot video object segmentation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 13 066–13 073.
  30. S. Ren, W. Liu, Y. Liu, H. Chen, G. Han, and S. He, “Reciprocal transformations for unsupervised video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 455–15 464.
  31. M. Siam, C. Jiang, S. Lu, L. Petrich, M. Gamal, M. Elhoseiny, and M. Jagersand, “Video object segmentation using teacher-student adaptation in a human robot interaction (HRI) setting,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2019, pp. 50–56.
  32. S. D. Jain, B. Xiong, and K. Grauman, “FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2126.
  33. W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao, “Zero-shot video object segmentation via attentive graph neural networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9236–9245.
  34. X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See more, know more: Unsupervised video object segmentation with co-attention siamese networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3623–3632.
  35. J. Ji, S. Buch, A. Soto, and J. C. Niebles, “End-to-end joint semantic segmentation of actors and actions in video,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 702–717.
  36. R. Gadde, V. Jampani, and P. V. Gehler, “Semantic video cnns through representation warping,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 4453–4462.
  37. S. Liu, C. Wang, R. Qian, H. Yu, R. Bao, and Y. Sun, “Surveillance video parsing with single frame supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 413–421.
  38. M. Ding, Z. Wang, B. Zhou, J. Shi, Z. Lu, and P. Luo, “Every frame counts: Joint learning of video segmentation and optical flow,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 10 713–10 720.
  39. S. Jain, X. Wang, and J. E. Gonzalez, “Accel: A corrective fusion network for efficient semantic segmentation on video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8866–8875.
  40. M. Paul, C. Mayer, L. V. Gool, and R. Timofte, “Efficient video semantic segmentation with labels propagation and refinement,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 2873–2882.
  41. H. Wang, W. Wang, and J. Liu, “Temporal memory attention for video semantic segmentation,” in Proceedings of the IEEE International Conference on Image Processing, 2021, pp. 2254–2258.
  42. J. Li, W. Wang, J. Chen, L. Niu, J. Si, C. Qian, and L. Zhang, “Video semantic segmentation via sparse temporal transformer,” in Proceedings of the ACM International Conference on Multimedia, 2021, pp. 59–68.
  43. J. Miao, Y. Wei, Y. Wu, C. Liang, G. Li, and Y. Yang, “VSPW: A large-scale dataset for video scene parsing in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 4133–4143.
  44. E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “SegFormer: Simple and efficient design for semantic segmentation with transformers,” in Proceedings of the Conference on Advances in Neural Information Processing Systems, 2021, pp. 12 077–12 090.
  45. X. Li, W. Zhang, J. Pang, K. Chen, G. Cheng, Y. Tong, and C. C. Loy, “Video K-Net: A simple, strong, and unified baseline for video segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 847–18 857.
  46. G. Sun, Y. Liu, H. Ding, T. Probst, and L. Van Gool, “Coarse-to-fine feature mining for video semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3126–3137.
  47. D. Kim, J. Xie, H. Wang, S. Qiao, Q. Yu, H.-S. Kim, H. Adam, I. S. Kweon, and L.-C. Chen, “TubeFormer-DeepLab: Video mask transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 914–13 924.
  48. X. Li, H. Yuan, W. Zhang, G. Cheng, J. Pang, and C. C. Loy, “Tube-Link: A flexible cross tube framework for universal video segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 923–13 933.
  49. N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang, “YouTube-VOS: Sequence-to-sequence video object segmentation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 585–601.
  50. A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” in Proceedings of the Asian Conference on Computer Vision.   IEEE, 2019, pp. 123–141.
  51. K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek, “Actor and action video segmentation from a sentence,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5958–5966.
  52. S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1395–1403.
  53. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015, pp. 234–241.
  54. T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117–2125.
  55. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2020.
  56. Y. Mao, N. Wang, W. Zhou, and H. Li, “Joint inductive and transductive learning for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9670–9679.
  57. Y. Zhang, Z. Wu, H. Peng, and S. Lin, “A transductive approach for video object segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6949–6958.
  58. A. J. Rana and Y. S. Rawat, “We don’t need thousand proposals: Single shot actor-action detection in videos,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2960–2969.
  59. Y. Zhang, P. Tiňo, A. Leonardis, and K. Tang, “A survey on neural network interpretability,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 5, no. 5, pp. 726–742, 2021.
  60. M. Kowal, M. Siam, M. A. Islam, N. D. Bruce, R. P. Wildes, and K. G. Derpanis, “A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 999–14 009.
  61. M. Bakken, J. Kvam, A. A. Stepanov, and A. Berge, “Principal feature visualisation in convolutional neural networks,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 18–31.
  62. H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 397–406.
  63. H. Zhao and R. P. Wildes, “Interpretable deep feature propagation for early action recognition,” arXiv preprint arXiv:2107.05122, 2021.
  64. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  65. G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in Proceedings of the International Conference on Machine Learning.   Proceedings of Machine Learning Research, 2021, pp. 813–824.
  66. A. Bulat, J. M. Perez Rua, S. Sudhakaran, B. Martinez, and G. Tzimiropoulos, “Space-time mixing attention for video transformer,” in Proceedings of the Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 19 594–19 607.
  67. Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8741–8750.
  68. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Conference on Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 5998–6008.
  69. C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: A joint model for video and language representation learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7464–7473.
  70. P. Krähenbühl and V. Koltun, “Efficient inference in fully connected CRFs with gaussian edge potentials,” in Proceedings of the Conference on Advances in Neural Information Processing Systems, vol. 24, 2011, pp. 109–117.
  71. P. Wang, “bidirectional-cross-attention.” [Online]. Available: https://github.com/lucidrains/bidirectional-cross-attention
  72. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  73. S. Seo, J.-Y. Lee, and B. Han, “URVOS: Unified referring video object segmentation network with a large-scale benchmark,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 208–223.
  74. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2017, pp. 2980–2988.
  75. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional neural networks for volumetric medical image segmentation,” in Proceedings of the International Conference on 3D Vision, 2016, pp. 565–571.
  76. C. Xu, S.-H. Hsieh, C. Xiong, and J. J. Corso, “Can humans fly? action understanding with multiple classes of actors,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2264–2273.
  77. H. Lamdouar, C. Yang, W. Xie, and A. Zisserman, “Betrayed by motion: Camouflaged object discovery via motion segmentation,” in Proceedings of the Asian Conference on Computer Vision, 2020, pp. 488–503.
  78. G.-P. Ji, K. Fu, Z. Wu, D.-P. Fan, J. Shen, and L. Shao, “Full-duplex strategy for video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 4922–4933.
  79. G. Pei, F. Shen, Y. Yao, G.-S. Xie, Z. Tang, and J. Tang, “Hierarchical feature alignment network for unsupervised video object segmentation,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 596–613.
  80. S. Cho, M. Lee, S. Lee, C. Park, D. Kim, and S. Lee, “Treating motion as option to reduce motion dependency in unsupervised video object segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5140–5149.
  81. M. Lee, S. Cho, S. Lee, C. Park, and S. Lee, “Unsupervised video object segmentation via prototype memory network,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 5924–5934.
  82. Y. Yuan, Y. Wang, L. Wang, X. Zhao, H. Lu, Y. Wang, W. Su, and L. Zhang, “Isomer: Isomerous transformer for zero-shot video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 966–976.
  83. M. Zhen, S. Li, L. Zhou, J. Shang, H. Feng, T. Fang, and L. Quan, “Learning discriminative feature with CRF for unsupervised video object segmentation,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 445–462.
  84. Y. Lee, H. Seong, and E. Kim, “Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 1245–1253.
  85. I. Akhter, M. Ali, M. Faisal, and R. Hartley, “EpO-Net: Exploiting geometric constraints on dense trajectories for motion saliency,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2019, pp. 1273–1283.
  86. W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling, “Learning unsupervised video object segmentation through visual attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3064–3074.
  87. Z. Yang, Q. Wang, L. Bertinetto, W. Hu, S. Bai, and P. H. Torr, “Anchor diffusion for unsupervised video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 931–940.
  88. H. Song, W. Wang, S. Zhao, J. Shen, and K.-M. Lam, “Pyramid dilated deeper convlstm for video salient object detection,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 715–731.
  89. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3202–3211.
  90. W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “PVT v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
  91. S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “CNN architectures for large-scale audio classification,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 131–135.
  92. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017.
  93. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 724–732.
  94. A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, “Learning object class detectors from weakly annotated video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012, pp. 3282–3289.
  95. G. Sun, Y. Liu, H. Tang, A. Chhatkuli, L. Zhang, and L. Van Gool, “Mining relations among cross-frame affinities for video semantic segmentation,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 522–539.
  96. N. Kitaev, L. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” in International Conference on Learning Representations, 2020.
  97. O. Texler, D. Futschik, M. Kučera, O. Jamriška, Š. Sochorová, M. Chai, S. Tulyakov, and D. Sỳkora, “Interactive video stylization using few-shot patch-based training,” ACM Transactions on Graphics (TOG), vol. 39, no. 4, pp. 73–1, 2020.
  98. J. Kim and K. Grauman, “Observe locally, infer globally: A space-time MRF for detecting abnormal activities with incremental update,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 2921–2928.
  99. D. Yu, E. Ricci, and N. Sebe, “Detecting anomalous events in videos by learning deep representations of appearance and motion,” Computer Vision and Image Understanding, vol. 156, pp. 117–127, 2017.
  100. T. Nguyen and J. Meunier, “Anomoly detection in video sequence with appearance motion correspondence,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1884–1893.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.