A Solution to CVPR'2023 AQTC Challenge: Video Alignment for Multi-Step Inference (2306.14412v1)
Abstract: Affordance-centric Question-driven Task Completion (AQTC) for Egocentric Assistant introduces a groundbreaking scenario. In this scenario, through learning instructional videos, AI assistants provide users with step-by-step guidance on operating devices. In this paper, we present a solution for enhancing video alignment to improve multi-step inference. Specifically, we first utilize VideoCLIP to generate video-script alignment features. Afterwards, we ground the question-relevant content in instructional videos. Then, we reweight the multimodal context to emphasize prominent features. Finally, we adopt GRU to conduct multi-step inference. Through comprehensive experiments, we demonstrate the effectiveness and superiority of our method, which secured the 2nd place in CVPR'2023 AQTC challenge. Our code is available at https://github.com/zcfinal/LOVEU-CVPR23-AQTC.
- Assistq: Affordance-centric question-driven task completion for egocentric assistant. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 485–501. Springer, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, 2015.
- Winning the cvpr’2022 aqtc challenge: A two-stage function-centric approach. arXiv preprint arXiv:2206.09597, 2022.
- Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443, 2018.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, 2021.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
- Bridging the gap between training and inference for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4334–4343, 2019.
- End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In CVPR, 2020.
- Juan Ramos et al. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 29–48. Citeseer, 2003.
- Adam: A method for stochastic optimization. In ICLR, 2015.