iRAG: Advancing RAG for Videos with an Incremental Approach (2404.12309v2)
Abstract: Retrieval-augmented generation (RAG) systems combine the strengths of language generation and information retrieval to power many real-world applications like chatbots. Use of RAG for understanding of videos is appealing but there are two critical limitations. One-time, upfront conversion of all content in large corpus of videos into text descriptions entails high processing times. Also, not all information in the rich video data is typically captured in the text descriptions. Since user queries are not known apriori, developing a system for video to text conversion and interactive querying of video data is challenging. To address these limitations, we propose an incremental RAG system called iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of a large corpus of videos. Unlike traditional RAG, iRAG quickly indexes large repositories of videos, and in the incremental workflow, it uses the index to opportunistically extract more details from select portions of the videos to retrieve context relevant to an interactive user query. Such an incremental workflow avoids long video to text conversion times, and overcomes information loss issues due to conversion of video to text, by doing on-demand query-specific extraction of details in video data. This ensures high quality of responses to interactive user queries that are often not known apriori. To the best of our knowledge, iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of a large corpus of videos. Experimental results on real-world datasets demonstrate 23x to 25x faster video to text ingestion, while ensuring that latency and quality of responses to interactive user queries is comparable to responses from a traditional RAG where all video data is converted to text upfront before any user querying.
- Y. Shibata, Y. Kawashima, M. Isogawa, G. Irie, A. Kimura, and Y. Aoki, “Listening human behavior: 3d human pose estimation with acoustic signals,” in CVPR, 2023.
- U. Mittal, P. Chawla, and R. Tiwari, “Ensemblenet: A hybrid approach for vehicle detection and estimation of traffic density based on faster r-cnn and yolo models,” Neural Computing and Applications, vol. 35, no. 6, 2023.
- S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
- Showlab, “VLog: Video As a Long Document.” https://github.com/showlab/VLog, 2023. GitHub Repository.
- K. Lin, F. Ahmed, L. Li, C.-C. Lin, E. Azarnasab, Z. Yang, J. Wang, L. Liang, Z. Liu, Y. Lu, C. Liu, and L. Wang, “MM-VID: Advancing Video Understanding with GPT-4V(ision),” arXiv preprint arXiv:2310.19773, 2023.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive nlp tasks,” in NeurIPS, 2020.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
- Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in CVPR, 2014.
- J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
- V.-Q. Nguyen, M. Suganuma, and T. Okatani, “Grit: Faster and better image captioning transformer using dual visual features,” in ECCV, 2022.
- Y. Piadyk, J. Rulff, E. Brewer, M. Hosseini, K. Ozbay, M. Sankaradas, S. Chakradhar, and C. Silva, “Streetaware: A high-resolution synchronized multimodal urban scene dataset,” Sensors, vol. 23, no. 7, 2023.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, pp. 8748–8763, PMLR, 2021.
- “VQA-v2 dataset.” https://huggingface.co/datasets/HuggingFaceM4/VQAv2, 2024. [Online; accessed 19-December-2024].
- “MSR-vtt dataset.” https://huggingface.co/datasets/AlexZigma/msr-vtt, 2024. [Online; accessed 19-December-2024].
- F. Kossmann, Z. Wu, E. Lai, N. Tatbul, L. Cao, T. Kraska, and S. Madden, “Extract-transform-load for video streams,” arXiv preprint arXiv:2310.04830, 2023.
- Harrison Chase, “LangChain.” https://github.com/hwchase17/langchain, 2022. 2022-10-17.
- J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Transactions on Big Data, vol. 7, no. 3, 2019.
- https://openai.com/product, 2023. Accessed: July 11, 2023.
- Z. Ji, T. Yu, Y. Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating llm hallucination via self reflection,” in EMNLP, 2023.
- J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks for action recognition in videos,” TPAMI, vol. 41, no. 11, 2019.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in CVPR, 2019.
- G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?,” in ICML, 2021.
- Y. Zhang, X. Li, C. Liu, B. Shuai, Y. Zhu, B. Brattoli, H. Chen, I. Marsic, and J. Tighe, “Vidtr: Video transformer without convolutions,” in ICCV, 2021.
- A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in ICCV, 2021.
- C.-Y. Ma, A. Kadav, I. Melvin, Z. Kira, G. AlRegib, and H. P. Graf, “Attend and interact: Higher-order object interactions for video understanding,” in CVPR, 2018.
- K. Li, Y. Wang, P. Gao, G. Song, Y. Liu, H. Li, and Y. Qiao, “Uniformer: Unified transformer for efficient spatiotemporal representation learning,” in ICLR, 2022.
- B. Debnath, O. Po, F. A. Chowdhury, and S. Chakradhar, “Cosine similarity based few-shot video classifier with attention-based aggregation,” in ICPR, 2022.
- Z. Lin, E. Bas, K. Y. Singh, G. Swaminathan, and R. Bhotika, “Relaxing contrastiveness in multimodal representation learning,” in CVPR, 2023.
- V. Lialin, S. Rawls, D. Chan, S. Ghosh, A. Rumshisky, and W. Hamza, “Scalable and accurate self-supervised multimodal representation learning without aligned video and text data,” in CVPR, 2023.
- W. Lei, Y. Ge, J. Zhang, D. Sun, K. Yi, Y. Shan, and M. Z. Shou, “Vit-lens: Towards omni-modal representations,” arXiv preprint arXiv:2308.10185, 2023.
- L. Li, Z. Gan, K. Lin, C.-C. Lin, Z. Liu, C. Liu, and L. Wang, “Lavender: Unifying video-language understanding as masked language modeling,” in CVPR, 2023.
- K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint arXiv:2305.06355, 2023.
- M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” arXiv preprint arXiv:2306.05424, 2023.
- H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
- Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang, and T. Sun, “Llavar: Enhanced visual instruction tuning for text-rich image understanding,” arXiv preprint arXiv:2306.17107, 2023.
- Y. Wang, M. Yasunaga, H. Ren, S. Wada, and J. Leskovec, “Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering,” in CVPR, 2023.
- Y. Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, and J. Luo, “Promptcap: Prompt-guided image captioning for vqa with gpt-3,” in CVPR, 2023.
- X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answering,” arXiv preprint arXiv:2305.10415, 2023.
- D. Gao, R. Wang, Z. Bai, and X. Chen, “Env-qa: A video question answering benchmark for comprehensive understanding of dynamic environments,” in CVPR, 2021.
- D. Gao, L. Zhou, L. Ji, L. Zhu, Y. Yang, and M. Z. Shou, “Mist: Multi-modal iterative spatial-temporal transformer for long-form video question answering,” in CVPR, 2023.
- C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and editing with visual foundation models,” arXiv preprint arXiv:2303.04671, 2023.
- S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv preprint arXiv:2309.05519, 2023.
- Md Adnan Arefeen (7 papers)
- Biplob Debnath (6 papers)
- Md Yusuf Sarwar Uddin (6 papers)
- Srimat Chakradhar (16 papers)