Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Write What You Want: Applying Text-to-video Retrieval to Audiovisual Archives (2310.05825v1)

Published 9 Oct 2023 in cs.MM and cs.HC

Abstract: Audiovisual (AV) archives, as an essential reservoir of our cultural assets, are suffering from the issue of accessibility. The complex nature of the medium itself made processing and interaction an open challenge still in the field of computer vision, multimodal learning, and human-computer interaction, as well as in culture and heritage. In recent years, with the raising of video retrieval tasks, methods in retrieving video content with natural language (text-to-video retrieval) gained quite some attention and have reached a performance level where real-world application is on the horizon. Appealing as it may sound, such methods focus on retrieving videos using plain visual-focused descriptions of what has happened in the video and finding videos such as instructions. It is too early to say such methods would be the new paradigms for accessing and encoding complex video content into high-dimensional data, but they are indeed innovative attempts and foundations to build future exploratory interfaces for AV archives (e.g. allow users to write stories and retrieve related snippets in the archive, or encoding video content at high-level for visualisation). This work filled the application gap by examining such text-to-video retrieval methods from an implementation point of view and proposed and verified a classifier-enhanced workflow to allow better results when dealing with in-situ queries that might have been different from the training dataset. Such a workflow is then applied to the real-world archive from T\'el\'evision Suisse Romande (RTS) to create a demo. At last, a human-centred evaluation is conducted to understand whether the text-to-video retrieval methods improve the overall experience of accessing AV archives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812.
  2. A context-based approach for dialogue act recognition using simple recurrent neural networks. arXiv preprint arXiv:1805.06280 (2018).
  3. Diego Cavallotti. 2018. From grain to pixel? Notes on the technical dialectics in the small gauge film archive. NECSUS. European Journal of Media Studies 7, 1 (2018), 145–164.
  4. David Chen and William Dolan. 2011. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 190–200. https://aclanthology.org/P11-1020
  5. The MSR-Video to Text dataset with clean annotations. Computer Vision and Image Understanding 225 (2022), 103581.
  6. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry. 253–262.
  7. Mdmmt: Multidomain multimodal transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3354–3363.
  8. Ray Edmondson and Ray Edmonson. 2004. Audiovisual archiving: philosophy and principles. Unesco Paris.
  9. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).
  10. From lifestyle vlogs to everyday interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4991–5000.
  11. Blog function revisited: A content analysis of MySpace blogs. CyberPsychology & Behavior 12, 6 (2009), 685–689.
  12. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 214–229.
  13. Optimized product quantization for approximate nearest neighbor search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2946–2953.
  14. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2010), 117–128.
  15. Cret: Cross-modal retrieval transformer for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 949–959.
  16. Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems 27 (2014).
  17. Sarah Kenderdine. 2021. Experimental museology: immersive visualisation and cultural (big) data. Experimental Museology: Institutions, Representations, Users. London & New York: Routledge (2021), 15–34.
  18. Computational archives for experimental museology. In Emerging Technologies and the Digital Transformation of Museums and Heritage Sites: First International Conference, RISE IMET 2021, Nicosia, Cyprus, June 2–4, 2021, Proceedings 1. Springer, 3–18.
  19. Dialogue act classification in domain-independent conversations using a deep recurrent neural network. In Proceedings of coling 2016, the 26th international conference on computational linguistics: Technical papers. 2012–2021.
  20. Mdmmt-2: Multidomain multimodal transformer for video retrieval, one more step towards generalization. arXiv preprint arXiv:2203.07086 (2022).
  21. Philippe Lejeune. 2009. On diary. University of Hawaii Press.
  22. Acute-eval: Improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv preprint arXiv:1909.03087 (2019).
  23. A dual-attention hierarchical recurrent neural network for dialogue act classification. arXiv preprint arXiv:1810.09154 (2018).
  24. Lev Manovich. 2020. Cultural analytics. Mit Press.
  25. Exploring Digitised Moving Image Collections: The SEMIA Project, Visual Analysis and the Turn to Abstraction. DHQ: Digital Humanities Quarterly 4 (2020).
  26. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.
  27. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630–2640.
  28. Thomas G Padilla. 2018. Collections as data: Implications for enclosure. College and Research Libraries News 79, 6 (2018), 296.
  29. A straightforward framework for video retrieval using clip. In Pattern Recognition: 13th Mexican Conference, MCPR 2021, Mexico City, Mexico, June 23–26, 2021, Proceedings. Springer, 3–12.
  30. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356 (2022).
  31. Vipul Raheja and Joel Tetreault. 2019. Dialogue act classification with context-aware self-attention. arXiv preprint arXiv:1904.02594 (2019).
  32. Movie description. International Journal of Computer Vision 123 (2017), 94–120.
  33. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
  34. Jetze Schuurmans and Flavius Frasincar. 2019. Intent classification for dialogue utterances. IEEE Intelligent Systems 35, 1 (2019), 82–88.
  35. Comparison of multinomial naive bayes algorithm and logistic regression for intent classification in chatbot. In 2018 International Conference on Applied Engineering (ICAE). IEEE, 1–5.
  36. Everything at once-multi-modal fusion transformer for video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20020–20029.
  37. Zero-shot video captioning with evolving pseudo-tokens. arXiv preprint arXiv:2207.11100 (2022).
  38. Greg Thompson and Ian Cook. 2017. The logic of data-sense: Thinking through learning personalisation. Discourse: Studies in the Cultural Politics of Education 38, 5 (2017), 740–754.
  39. Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval. In Proceedings of the ACM Web Conference 2022. 3020–3030.
  40. Sida I Wang and Christopher D Manning. 2012. Baselines and bigrams: Simple, good sentiment and topic classification. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 90–94.
  41. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6847–6857.
  42. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 322–330.
  43. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.
  44. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. arXiv preprint arXiv:2209.06430 (2022).
  45. Taco: Token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11562–11572.
  46. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471–487.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Yuchen Yang (60 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.