TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes (2403.18238v1)
Abstract: As drone technology advances, using unmanned aerial vehicles for aerial surveys has become the dominant trend in modern low-altitude remote sensing. The surge in aerial video data necessitates accurate prediction for future scenarios and motion states of the interested target, particularly in applications like traffic management and disaster response. Existing video prediction methods focus solely on predicting future scenes (video frames), suffering from the neglect of explicitly modeling target's motion states, which is crucial for aerial video interpretation. To address this issue, we introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target. Further, we design a model specifically for this task, named TAFormer, which provides a unified modeling approach for both video and target motion states. Specifically, we introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion. Additionally, we design an Information Sharing Mechanism (ISM), which elegantly unifies the modeling of video and target motion by facilitating information interaction through two sets of messenger tokens. Moreover, to alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content. Extensive experiments on UAV123VP and VisDroneVP (derived from single-object tracking datasets) demonstrate the exceptional performance of TAFormer in target-aware video prediction, showcasing its adaptability to the additional requirements of aerial video interpretation for target awareness.
- L. Jiao, X. Zhang, X. Liu, F. Liu, S. Yang, W. Ma, L. Li, P. Chen, Z. Feng, Y. Guo, et al., “Transformer meets remote sensing video detection and tracking: A comprehensive survey,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023.
- J. Zhang, X. Jia, J. Hu, and K. Tan, “Moving vehicle detection for remote sensing video surveillance with nonstationary satellite platform,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 5185–5198, 2021.
- Y. Cui, B. Hou, Q. Wu, B. Ren, S. Wang, and L. Jiao, “Remote sensing object tracking with deep reinforcement learning under occlusion,” IEEE transactions on geoscience and remote sensing, vol. 60, pp. 1–13, 2021.
- X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” Advances in neural information processing systems, vol. 28, 2015.
- X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo, “Deep learning for precipitation nowcasting: A benchmark and a new model,” Advances in neural information processing systems, vol. 30, 2017.
- Y. Wang, Z. Gao, M. Long, J. Wang, and S. Y. Philip, “Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning,” in International Conference on Machine Learning, pp. 5123–5132, PMLR, 2018.
- Y. Wang, H. Wu, J. Zhang, Z. Gao, J. Wang, S. Y. Philip, and M. Long, “Predrnn: A recurrent neural network for spatiotemporal predictive learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 2208–2225, 2022.
- Z. Gao, C. Tan, L. Wu, and S. Z. Li, “Simvp: Simpler yet better video prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3170–3180, 2022.
- C. Tan, Z. Gao, S. Li, and S. Z. Li, “Simvp: Towards simple yet powerful spatiotemporal predictive learning,” arXiv preprint arXiv:2211.12509, 2022.
- C. Tan, Z. Gao, L. Wu, Y. Xu, J. Xia, S. Li, and S. Z. Li, “Temporal attention unit: Towards efficient spatiotemporal predictive learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18770–18782, 2023.
- M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” arXiv preprint arXiv:1511.05440, 2015.
- Y. Zhong, L. Liang, I. Zharkov, and U. Neumann, “Mmvp: Motion-matrix-based video prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4273–4283, 2023.
- F. Altché and A. de La Fortelle, “An lstm network for highway trajectory prediction,” in 2017 IEEE 20th international conference on intelligent transportation systems (ITSC), pp. 353–359, IEEE, 2017.
- J. Wiest, M. Höffken, U. Kreßel, and K. Dietmayer, “Probabilistic trajectory prediction with gaussian mixture models,” in 2012 IEEE Intelligent vehicles symposium, pp. 141–146, IEEE, 2012.
- N. Nikhil and B. Tran Morris, “Convolutional neural network for trajectory prediction,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0, 2018.
- A. Houenou, P. Bonnifait, V. Cherfaoui, and W. Yao, “Vehicle trajectory prediction based on motion model and maneuver recognition,” in 2013 IEEE/RSJ international conference on intelligent robots and systems, pp. 4363–4369, IEEE, 2013.
- K. Messaoud, I. Yahiaoui, A. Verroust-Blondet, and F. Nashashibi, “Attention based vehicle trajectory prediction,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 175–185, 2020.
- F. Giuliari, I. Hasan, M. Cristani, and F. Galasso, “Transformer networks for trajectory forecasting,” CoRR, vol. abs/2003.08111, 2020.
- X. Mo, Z. Huang, Y. Xing, and C. Lv, “Multi-agent trajectory prediction with heterogeneous edge-enhanced graph attention network,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 9554–9567, 2022.
- J. Yu, M. Zhou, X. Wang, G. Pu, C. Cheng, and B. Chen, “A dynamic and static context-aware attention network for trajectory prediction,” ISPRS International Journal of Geo-Information, vol. 10, no. 5, p. 336, 2021.
- X. Huang, S. G. McGill, J. A. DeCastro, L. Fletcher, J. J. Leonard, B. C. Williams, and G. Rosman, “Diversitygan: Diversity-aware vehicle motion prediction via latent semantic sampling,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5089–5096, 2020.
- U. T. Benchmark, “A benchmark and simulator for uav tracking,” European Conference on Computer Vision, 2016.
- P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling, “Detection and tracking meet drones challenge,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7380–7399, 2021.
- X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE transactions on knowledge and data engineering, vol. 35, no. 1, pp. 857–876, 2021.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- Y. Mao, K. Chen, W. Diao, X. Sun, X. Lu, K. Fu, and M. Weinmann, “Beyond single receptive field: A receptive field fusion-and-stratification network for airborne laser scanning point cloud classification,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 188, pp. 45–61, 2022.
- Y. Mao, Z. Guo, L. Xiaonan, Z. Yuan, and H. Guo, “Bidirectional feature globalization for few-shot semantic segmentation of 3d point cloud scenes,” in 2022 International Conference on 3D Vision (3DV), pp. 505–514, IEEE, 2022.
- Y. Mao, K. Chen, L. Zhao, W. Chen, D. Tang, W. Liu, Z. Wang, W. Diao, X. Sun, and K. Fu, “Elevation estimation-driven building 3d reconstruction from single-view remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- H. Bi, Y. Feng, Z. Yan, Y. Mao, W. Diao, H. Wang, and X. Sun, “Not just learning from others but relying on yourself: A new perspective on few-shot segmentation in remote sensing,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- Y. Yao, T. Chen, H. Bi, X. Cai, G. Pei, G. Yang, Z. Yan, X. Sun, X. Xu, and H. Zhang, “Automated object recognition in high-resolution optical remote sensing imagery,” National Science Review, vol. 10, no. 6, p. nwad122, 2023.
- V. Michalski, R. Memisevic, and K. Konda, “Modeling deep temporal dependencies with recurrent grammar cells””,” Advances in neural information processing systems, vol. 27, 2014.
- Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms,” Advances in neural information processing systems, vol. 30, 2017.
- Y. Wang, J. Zhang, H. Zhu, M. Long, J. Wang, and P. S. Yu, “Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9154–9162, 2019.
- M. Oliu, J. Selva, and S. Escalera, “Folded recurrent neural networks for future video prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 716–731, 2018.
- S. Aigner and M. Körner, “Futuregan: Anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans,” arXiv preprint arXiv:1810.01325, 2018.
- Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow,” in Proceedings of the IEEE international conference on computer vision, pp. 4463–4471, 2017.
- Z. Xu, Y. Wang, M. Long, J. Wang, and M. KLiss, “Predcnn: Predictive learning with cascade convolutions.,” in IJCAI, pp. 2940–2947, 2018.
- X. Liu, J. Yin, J. Liu, P. Ding, J. Liu, and H. Liu, “Trajectorycnn: a new spatio-temporal feature learning network for human motion prediction,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 6, pp. 2133–2146, 2020.
- C. Tan, S. Li, Z. Gao, W. Guan, Z. Wang, Z. Liu, L. Wu, and S. Z. Li, “Openstl: A comprehensive benchmark of spatio-temporal predictive learning,” arXiv preprint arXiv:2306.11249, 2023.
- W. Yu, M. Luo, P. Zhou, C. Si, Y. Zhou, X. Wang, J. Feng, and S. Yan, “Metaformer is actually what you need for vision,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10819–10829, 2022.
- Y. Wu, R. Gao, J. Park, and Q. Chen, “Future video synthesis with object motion prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5539–5548, 2020.
- W. Guo, Y. Du, X. Shen, V. Lepetit, X. Alameda-Pineda, and F. Moreno-Noguer, “Back to mlp: A simple baseline for human motion prediction,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4809–4819, 2023.
- S. Rezaei, J. Gbadegoye, N. Masoud, and A. Khojandi, “A deep learning-based approach for vehicle motion prediction in autonomous driving,” in 2023 International Conference on Control, Automation and Diagnosis (ICCAD), pp. 1–6, IEEE, 2023.
- S. El Mekkaoui, L. Benabbou, S. Caron, and A. Berrado, “Deep learning-based ship speed prediction for intelligent maritime traffic management,” Journal of Marine Science and Engineering, vol. 11, no. 1, p. 191, 2023.
- A. Agarwal, M. Lalit, A. Bansal, and K. Seeja, “isgan: An improved sgan for crowd trajectory prediction from surveillance videos,” Procedia Computer Science, vol. 218, pp. 2319–2327, 2023.
- M. Ye, T. Cao, and Q. Chen, “Tpcn: Temporal point cloud networks for motion forecasting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11318–11327, 2021.
- M. Liang, B. Yang, R. Hu, Y. Chen, R. Liao, S. Feng, and R. Urtasun, “Learning lane graph representations for motion forecasting,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 541–556, Springer, 2020.
- Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha, “Trafficpredict: Trajectory prediction for heterogeneous traffic-agents,” in Proceedings of the AAAI conference on artificial intelligence, pp. 6120–6127, 2019.
- J. Pan, H. Sun, K. Xu, Y. Jiang, X. Xiao, J. Hu, and J. Miao, “Lane-attention: Predicting vehicles’ moving trajectories by learning their attention over lanes,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7949–7956, IEEE, 2020.
- S. Kolekar, S. Gite, B. Pradhan, and K. Kotecha, “Behavior prediction of traffic actors for intelligent vehicle using artificial intelligence techniques: A review,” IEEE Access, vol. 9, pp. 135034–135058, 2021.
- H. Song, D. Luan, W. Ding, M. Y. Wang, and Q. Chen, “Learning to predict vehicle trajectories with model-based planning,” in Conference on Robot Learning, pp. 1035–1045, PMLR, 2022.
- S. Casas, C. Gulino, R. Liao, and R. Urtasun, “Spagnn: Spatially-aware graph neural networks for relational behavior forecasting from sensor data,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9491–9497, IEEE, 2020.
- H. Cui, V. Radosavljevic, F.-C. Chou, T.-H. Lin, T. Nguyen, T.-K. Huang, J. Schneider, and N. Djuric, “Multimodal trajectory predictions for autonomous driving using deep convolutional networks,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 2090–2096, IEEE, 2019.
- T. Phan-Minh, E. C. Grigore, F. A. Boulton, O. Beijbom, and E. M. Wolff, “Covernet: Multimodal behavior prediction using trajectory sets,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14074–14083, 2020.
- N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F.-C. Chou, T.-H. Lin, and J. Schneider, “Short-term motion prediction of traffic actors for autonomous driving using deep convolutional networks, 2018,” Uber advanced technologies group, 2018.
- P. Li, S. Ding, X. Chen, N. Hanselmann, M. Cordts, and J. Gall, “Powerbev: A powerful yet lightweight framework for instance prediction in bird’s-eye view,” arXiv preprint arXiv:2306.10761, 2023.
- J.-T. Zhai, Z. Feng, J. Du, Y. Mao, J.-J. Liu, Z. Tan, Y. Zhang, X. Ye, and J. Wang, “Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes,” arXiv preprint arXiv:2305.10430, 2023.
- M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,” arXiv preprint arXiv:1812.03079, 2018.
- J. Hong, B. Sapp, and J. Philbin, “Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8454–8462, 2019.
- S. Casas, W. Luo, and R. Urtasun, “Intentnet: Learning to predict intention from raw sensor data,” in Conference on Robot Learning, pp. 947–956, PMLR, 2018.
- W. Luo, B. Yang, and R. Urtasun, “Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577, 2018.
- A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V. Badrinarayanan, R. Cipolla, and A. Kendall, “Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15273–15282, 2021.
- J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 194–210, Springer, 2020.
- B. Cheng, M. D. Collins, Y. Zhu, T. Liu, T. S. Huang, H. Adam, and L.-C. Chen, “Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12475–12485, 2020.
- J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141, 2018.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- Z. Chen, B. Zhong, G. Li, S. Zhang, R. Ji, Z. Tang, and X. Li, “Siamban: Target-aware tracking with siamese box adaptive network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 5158–5173, 2023.
- Y. Cui, C. Jiang, G. Wu, and L. Wang, “Mixformer: End-to-end tracking with iterative mixed attention,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
- V. L. Guen and N. Thome, “Disentangling physical dynamics from unknown factors for unsupervised video prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11474–11484, 2020.