Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives (2310.05835v1)

Published 9 Oct 2023 in cs.MM and cs.HC

Abstract: Audiovisual (AV) archives are invaluable for holistically preserving the past. Unlike other forms, AV archives can be difficult to explore. This is not only because of its complex modality and sheer volume but also the lack of appropriate interfaces beyond keyword search. The recent rise in text-to-video retrieval tasks in computer science opens the gate to accessing AV content more naturally and semantically, able to map natural language descriptive sentences to matching videos. However, applications of this model are rarely seen. The contribution of this work is threefold. First, working with RTS (T\'el\'evision Suisse Romande), we identified the key blockers in a real archive for implementing such models. We built a functioning pipeline for encoding raw archive videos to the text-to-video feature vectors. Second, we designed and verified a method to encode and retrieve videos using emotionally abundant descriptions not supported in the original model. Third, we proposed an initial prototype for immersive and interactive exploration of AV archives in a latent space based on the previously mentioned encoding of videos.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. A fuzzy inference system for synergy estimation of simultaneous emotion dynamics in agents. Int. J. Sci. Eng. Res 2, 6 (2011), 35–41.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.
  3. Interactive narrative as a multi-temporal agency. representations 84 (2000), 88.
  4. Emotional paraphrasing using pre-trained language models. In 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 01–07.
  5. The MSR-video to text dataset with clean annotations. arXiv preprint arXiv:2102.06448 (2021).
  6. Encoding musical style with transformer autoencoders. In International Conference on Machine Learning. PMLR, 1899–1908.
  7. Prithiviraj Damodaran. 2021. Parrot: Paraphrase generation for NLU. v1. 0 (2021).
  8. Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion 6, 3-4 (1992), 169–200.
  9. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 214–229.
  10. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  11. Robert R Janes and Richard Sandell. 2019. Museum activism. Taylor & Francis.
  12. Sarah Kenderdine. 2020. Prosthetic architectures of the senses: museums and immersion. Screen 61, 4 (2020), 635–645.
  13. Tamara Klopper. 2022. New installation Film Catcher opened in Eye. https://www.eyefilm.nl/nl/magazine/nieuwe-installatie-film-catcher-geopend-in-eye/852512
  14. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331–7341.
  15. Taking an Emotional Look at Video Paragraph Captioning. arXiv preprint arXiv:2203.06356 (2022).
  16. Ts2-net: Token shift and selection transformer for text-video retrieval. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV. Springer, 319–335.
  17. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293–304.
  18. Exploring Digitised Moving Image Collections: The SEMIA Project, Visual Analysis and the Turn to Abstraction. DHQ: Digital Humanities Quarterly 4 (2020).
  19. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
  20. Robert Plutchik. 2001. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American scientist 89, 4 (2001), 344–350.
  21. Surprise machines. Information Design Journal 27, 1 (2022), 1–14.
  22. Tomáš Souček and Jakub Lokoč. 2020. Transnet V2: an effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838 (2020).
  23. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision. 7464–7473.
  24. TVLT: Textless Vision-Language Transformer. arXiv preprint arXiv:2209.14156 (2022).
  25. Zero-shot video captioning with evolving pseudo-tokens. arXiv preprint arXiv:2207.11100 (2022).
  26. Emotion expression with fact transfer for video description. IEEE Transactions on Multimedia 24 (2021), 715–727.
  27. End-to-end dense video captioning with parallel decoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6847–6857.
  28. Tableformer: Robust transformer modeling for table-text encoding. arXiv preprint arXiv:2203.00274 (2022).
  29. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2236–2246.
  30. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In Proceedings of the IEEE/CVF international conference on computer vision. 2998–3008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yuchen Yang (60 papers)
  2. Linyida Zhang (1 paper)
Citations (2)

Summary

We haven't generated a summary for this paper yet.