Papers
Topics
Authors
Recent
Search
2000 character limit reached

VCR: Video representation for Contextual Retrieval

Published 12 Feb 2024 in cs.IR and cs.MM | (2402.07466v1)

Abstract: Streamlining content discovery within media archives requires integrating advanced data representations and effective visualization techniques for clear communication of video topics to users. The proposed system addresses the challenge of efficiently navigating large video collections by exploiting a fusion of visual, audio, and textual features to accurately index and categorize video content through a text-based method. Additionally, semantic embeddings are employed to provide contextually relevant information and recommendations to users, resulting in an intuitive and engaging exploratory experience over our topics ontology map using OpenAI GPT-4.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Docbert: Bert for document classification. arXiv preprint arXiv:1904.08398 (2019).
  2. Visualization and Visual Analytics Approaches for Image and Video Datasets: A Survey. ACM Transactions on Interactive Intelligent Systems 13, 1 (2023), 1–41.
  3. Condensed Movies: Story Based Retrieval with Contextual Embeddings. arXiv:2005.04208 [cs.CV]
  4. Longformer: The Long-Document Transformer. ArXiv abs/2004.05150 (2020). https://api.semanticscholar.org/CorpusID:215737171
  5. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
  6. TRECVID 2010 Known-item Search (KIS) Task by I2R.. In TRECVID, Vol. 10. Citeseer, Citeseer, PA, USA, 1–1.
  7. Taming pretrained transformers for extreme multi-label text classification. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. Association for Computing MachineryNew YorkNYUnited States, CA USA, 3163–3171.
  8. Peakvizor: Visual analytics of peaks in video clickstreams from massive open online courses. IEEE transactions on visualization and computer graphics 22, 10 (2015), 2315–2330.
  9. Viseq: Visual analytics of learning sequence in massive open online courses. IEEE transactions on visualization and computer graphics 26, 3 (2018), 1622–1636.
  10. VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset. arXiv preprint arXiv:2305.18500 (2023).
  11. Content based video retrieval using integrated feature extraction and personalization of results. In 2015 International Conference on Information Processing (ICIP). IEEE, IEEE, NJ, USA, 170–175.
  12. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
  13. The TDT-2 text and speech corpus. In Proceedings of the DARPA Broadcast News workshop. The Linguistic Data Consortium (LDC), The University of Pennsylvania, 57–60.
  14. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  15. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805 cite arxiv:1810.04805Comment: 13 pages.
  16. VisTA: integrating machine intelligence with visualization to support the investigation of think-aloud sessions. IEEE Transactions on Visualization and Computer Graphics 26, 1 (2019), 343–352.
  17. Finding a needle in a haystack: an interactive video archive explorer for professional video searchers. Multimedia tools and applications 63 (2013), 331–356.
  18. VIAN: A visual annotation tool for film analysis. In Computer Graphics Forum, Vol. 38. Wiley Online Library, Wiley Online Library, NJ, USA, 119–129.
  19. Tracking topic evolution via salient keyword matching with consideration of semantic broadness for Web video discovery. Multimedia Tools and Applications 77, 16 (2018), 20297–20324.
  20. VUSphere: Visual analysis of video utilization in online distance education. In 2018 IEEE Conference on Visual Analytics Science and Technology (VAST). IEEE, IEEE, Berlin, Germany, 25–35.
  21. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. https://doi.org/10.48550/ARXIV.2111.09543
  22. DeBERTa: Decoding-enhanced BERT with Disentangled Attention. https://doi.org/10.48550/ARXIV.2006.03654
  23. Localizing Moments in Video with Temporal Language.. In Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics (ACL), PA, USA.
  24. Thomas Hofmann. 1999. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, CA, USA, 50–57.
  25. Movienet: A holistic dataset for movie understanding. In European Conference on Computer Vision. Springer, Springer, Germany, 709–727.
  26. International Press Telecommunications Council year-of-version-you-are-using. IPTC Media Topics. International Press Telecommunications Council. https://iptc.org/standards/media-topics/ Accessed: 2023-06-24.
  27. Visual exploration of topics in multimedia news corpora. In 2019 23rd International Conference Information Visualisation (IV). IEEE, IEEE, Paris, France, 241–248.
  28. Daniel Jurafsky and James H Martin. 2000. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. https://api.semanticscholar.org/CorpusID:60691216
  29. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
  30. Online web video topic detection and tracking with semi-supervised learning. Multimedia Systems 22, 1 (2016), 115–125.
  31. Sea: Sentence encoder assembly for video retrieval by textual queries. IEEE Transactions on Multimedia 23 (2020), 4351–4362.
  32. RoBERTa: A Robustly Optimized BERT Pretraining Approach. https://doi.org/10.48550/ARXIV.1907.11692
  33. E Matthew. 2018. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations.. In Proc. of NAACL. ACL, Louisiana, USA.
  34. Visualizing television archives. Bulletin of the American Society for Information Science and Technology 35, 5 (2009), 19–23.
  35. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. https://doi.org/10.48550/ARXIV.1906.03327
  36. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, C.J. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger (Eds.), Vol. 26. Curran Associates, Inc., Nevada, USA. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf
  37. VIVA: visual information retrieval in video archives. International Journal on Digital Libraries 23, 4 (2022), 319–333.
  38. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. In Proceedings of the AAAI conference on artificial intelligence, Vol. 31. AAAI, CA, USA.
  39. Multimodal deep learning. ICML (2011), 689–696.
  40. CAST: Character labeling in Animation using Self-supervision by Tracking. In Computer Graphics Forum, Vol. 41. Wiley Online Library, Wiley Online Library, Riems, France, 135–145.
  41. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  42. Glove: Global Vectors for Word Representation.. In EMNLP, Vol. 14. ACL, Doha, Qatar, 1532–1543.
  43. Deep contextualized word representations. http://arxiv.org/abs/1802.05365 cite arxiv:1802.05365Comment: NAACL 2018. Originally posted to openreview 27 Oct 2017. v2 updated for NAACL camera ready.
  44. Multi-modal event topic model for social event analysis. IEEE transactions on multimedia 18, 2 (2015), 233–246.
  45. Language Models are Unsupervised Multitask Learners. OpenAI blog (2019).
  46. Evaluating face tracking for political analysis in Japanese news over a long period of time. In IEEE/WIC/ACM International Conference on Web Intelligence-Companion Volume. IEEE, NJ, USA, 51–58.
  47. When face-tracking meets social networks: a story of politics in news videos. Applied Network Science 1 (2016), 2364–2380.
  48. A multimedia document browser based on multilayer networks. Multimedia Tools and Applications 80 (2021), 22551–22588.
  49. Anna Rohrbach and Jae Sung Park. 2019. Large Scale Movie Description Challenge (LSMDC) 2019.
  50. On the stratification of multi-label data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Springer, Berlin, Germany, 145–158.
  51. Metro maps of information. ACM SIGWEB Newsletter 1, Spring (2013), 1–9.
  52. VisMOOC: Visualizing video clickstream data from massive open online courses. In 2015 IEEE Pacific visualization symposium (PacificVis). IEEE, IEEE, NJ, USA, 159–166.
  53. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Luoisiana, USA, 5026–5035.
  54. TED. 2023. Ideas worth spreading. https://www.ted.com Accessed: 2023-06-24.
  55. Hierarchical Dirichlet Processes. J. Amer. Statist. Assoc. 101, 476 (2006), 1566–1581. https://doi.org/10.1198/016214506000000302
  56. Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605. http://www.jmlr.org/papers/v9/vandermaaten08a.html
  57. The extreme classification repository: Multi-label datasets & code.
  58. Attention is all you need. Advances in neural information processing systems 30 (2017).
  59. Factorized multi-modal topic model. arXiv preprint arXiv:1210.4920 (2012).
  60. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
  61. Creating an improved version using noisy OCR from multiple editions. In 2013 12th International Conference on Document Analysis and Recognition. IEEE, IEEE, NJ, USA, 160–164.
  62. Aoyu Wu and Huamin Qu. 2018. Multimodal analysis of video collections: Visual exploration of presentation techniques in ted talks. IEEE transactions on visualization and computer graphics 26, 7 (2018), 2429–2442.
  63. Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016).
  64. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE, NJ, USA, 5288–5296.
  65. Florence: A New Foundation Model for Computer Vision. https://doi.org/10.48550/ARXIV.2111.11432
  66. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems 33 (2020), 17283–17297.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 5 likes about this paper.