VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression (2303.08906v2)
Abstract: In content-based video retrieval (CBVR), dealing with large-scale collections, efficiency is as important as accuracy; thus, several video-level feature-based studies have actively been conducted. Nevertheless, owing to the severe difficulty of embedding a lengthy and untrimmed video into a single feature, these studies have been insufficient for accurate retrieval compared to frame-level feature-based studies. In this paper, we show that appropriate suppression of irrelevant frames can provide insight into the current obstacles of the video-level approaches. Furthermore, we propose a Video-to-Video Suppression network (VVS) as a solution. VVS is an end-to-end framework that consists of an easy distractor elimination stage to identify which frames to remove and a suppression weight generation stage to determine the extent to suppress the remaining frames. This structure is intended to effectively describe an untrimmed video with varying content and meaningless information. Its efficacy is proved via extensive experiments, and we show that our approach is not only state-of-the-art in video-level approaches but also has a fast inference time despite possessing retrieval capabilities close to those of frame-level approaches. Code is available at https://github.com/sejong-rcv/VVS
- 2015. Call for Proposals for Compact Descriptors for Video Analysis.
- AC-SUM-GAN: Connecting actor-critic and generative adversarial networks for unsupervised video summarization. IEEE Transactions on Circuits and Systems for Video Technology.
- Unsupervised video summarization via attention-driven adversarial learning. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part I 26.
- Summarizing videos using concentrated attention and considering the uniqueness and diversity of the video frames. In Proceedings of the 2022 International Conference on Multimedia Retrieval.
- A stepwise, label-based approach for improving the adversarial training in unsupervised video summarization. In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery.
- NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- LAMV: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Parametric correspondence and chamfer matching: Two new techniques for image matching. In Proceedings of Image Understanding Workshop.
- A clustering method for information retrieval. Technical ReportIR-0199, Laboratoire d’Informatique d’Avignon, France.
- Pattern-based near-duplicate video retrieval and localization on web-scale videos. IEEE Transactions on Multimedia.
- Circulant temporal encoding for video retrieval and temporal alignment. International Journal of Computer Vision.
- Enhanced deep video summarization network. In BMVC.
- Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science.
- Symmetrical synthesis for deep metric learning. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Creating summaries from user videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Unsupervised video summarization with attentive conditional generative adversarial networks. In Proceedings of the 27th ACM International Conference on multimedia.
- What makes a video a video: Analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening. In European Conference on Computer Vision.
- VCDB: a large-scale database for partial copy detection in videos. In European Conference on Computer Vision.
- Simultaneous Video Retrieval and Alignment. IEEE Access.
- Exploring the Temporal Cues to Enhance Video Retrieval on Standardized CDVA. IEEE Access.
- Combination of multiple global descriptors for image retrieval. arXiv preprint arXiv:1903.10663.
- Discriminative feature learning for unsupervised video summarization. In Proceedings of the AAAI Conference on artificial intelligence.
- Global-and-local relative position embedding for unsupervised video summarization. In European Conference on Computer Vision.
- Unsupervised video summarization via multi-source features. In Proceedings of the 2021 International Conference on Multimedia Retrieval.
- UBoCo: Unsupervised Boundary Contrastive Learning for Generic Event Boundary Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Embedding expansion: Augmentation in embedding space for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- A benchmark on tricks for large-scale image retrieval. arXiv preprint arXiv:1907.11854.
- FIVR: Fine-grained incident video retrieval. IEEE Transactions on Multimedia.
- ViSiL: Fine-grained spatio-temporal video similarity learning. In Proceedings of IEEE Conference on Computer Vision.
- Near-duplicate video retrieval by aggregating intermediate cnn layers. In Proceedings of International Conference on Multimedia Modeling.
- Near-duplicate video retrieval with deep metric learning. In Proceedings of IEEE Conference on Computer Vision Workshops.
- DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval. International Journal of Computer Vision.
- Exploring global diverse attention via pairwise temporal relation for video summarization. Pattern Recognition.
- Hnip: Compact deep invariant representations for video matching, localization, and retrieval. IEEE Transactions on Multimedia.
- Network in network. arXiv preprint arXiv:1312.4400.
- Cluster-based retrieval using language models. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval.
- VRAG: Region Attention Graphs for Content-Based Video Retrieval. arXiv preprint arXiv:2205.09068.
- Tempered sigmoid activations for deep learning with differential privacy. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Self-attention recurrent summarization network with reinforcement learning for video summarization task. In 2021 IEEE International Conference on Multimedia and Expo (ICME).
- Temporal matching kernel with explicit feature maps. In Proceedings of ACM International Conference on Multimedia.
- Event retrieval in large video collections with circulant temporal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Video summarization by learning from unpaired data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
- Video summarization using fully convolutional sequence networks. In Proceedings of the European conference on computer vision.
- Temporal context aggregation for video retrieval with contrastive learning. In Proceedings of IEEE Winter Conference on Applications of Computer Vision.
- Effective multiple feature hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia.
- Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition.
- Strzalkowski, T. 1995. Natural language information retrieval. Information Processing & Management.
- Scalable detection of partial near-duplicate videos by visual-temporal consistency. In Proceedings of ACM International Conference on Multimedia.
- Particular Object Retrieval With Integral Max-Pooling of CNN Activations. In Proceedings of International Conference on Learning Representations.
- Graph attention networks. arXiv preprint arXiv:1710.10903.
- Era: Entity relationship aware video summarization with wasserstein gan. arXiv preprint arXiv:2109.02625.
- Real-time near-duplicate elimination for web video search with content and context. IEEE Transactions on Multimedia.
- Using independently recurrent networks for reinforcement learning based unsupervised video summarization. Multimedia Tools and Applications.
- Deep multi-task representation learning: A tensor factorisation approach. arXiv preprint arXiv:1605.06391.
- Reconstructive sequence-graph network for video summarization. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Property-constrained dual learning for video summarization. IEEE transactions on neural networks and learning systems.
- Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Zhu, M. 2004. Recall, precision and average precision. Department of Statistics and Actuarial Science, University of Waterloo, Waterloo.