Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection (2401.02309v2)

Published 4 Jan 2024 in cs.CV and cs.MM

Abstract: Video moment retrieval (MR) and highlight detection (HD) based on natural language queries are two highly related tasks, which aim to obtain relevant moments within videos and highlight scores of each video clip. Recently, several methods have been devoted to building DETR-based networks to solve both MR and HD jointly. These methods simply add two separate task heads after multi-modal feature extraction and feature interaction, achieving good performance. Nevertheless, these approaches underutilize the reciprocal relationship between two tasks. In this paper, we propose a task-reciprocal transformer based on DETR (TR-DETR) that focuses on exploring the inherent reciprocity between MR and HD. Specifically, a local-global multi-modal alignment module is first built to align features from diverse modalities into a shared latent space. Subsequently, a visual feature refinement is designed to eliminate query-irrelevant information from visual features for modal interaction. Finally, a task cooperation module is constructed to refine the retrieval pipeline and the highlight score prediction process by utilizing the reciprocity between MR and HD. Comprehensive experiments on QVHighlights, Charades-STA and TVSum datasets demonstrate that TR-DETR outperforms existing state-of-the-art methods. Codes are available at \url{https://github.com/mingyao1120/TR-DETR}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Joint Visual and Audio Learning for Video Highlight Detection. In IEEE ICCV, 8107–8117.
  2. End-to-End Object Detection with Transformers. In ECCV, 213–229. Springer.
  3. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In IEEE CVPR, 4724–4733.
  4. Semantic Proposal for Activity Localization in Videos via Sentence Query. In AAAI, 8199–8206.
  5. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NeurIPS 2014 Workshop on Deep Learning.
  6. Temporal Localization of Moments in Video Collections with Natural Language. CoRR, abs/1907.12763.
  7. SlowFast Networks for Video Recognition. In IEEE ICCV, 6201–6210.
  8. System-Status-Aware Adaptive Network for Online Streaming Video Understanding. In IEEE CVPR, 10514–10523.
  9. TALL: Temporal Activity Localization via Language Query. In IEEE ICCV, 5277–5285.
  10. MAC: Mining Activity Concepts for Language-Based Temporal Localization. In IEEE WACV, 245–253.
  11. Audio Set: An ontology and human-labeled dataset for audio events. In IEEE ICASSP, 776–780.
  12. ExCL: Extractive Clip Localization Using Natural Language Descriptions. In NAACL, 1984–1990. Minneapolis, Minnesota: ACL.
  13. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In NeurIPS.
  14. TaoHighlight: Commodity-Aware Multi-Modal Video Highlight Detection in E-Commerce. IEEE TMM, 24: 2606–2616.
  15. Tripping through time: Efficient Localization of Activities in Videos. In BMVC.
  16. Localizing Moments in Video with Temporal Language. In EMNLP, 1380–1390. ACL.
  17. MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection. In ECCV, 345–360. Springer.
  18. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition. IEEE TASLP, 28: 2880–2894.
  19. Words speak for actions: Using text to find video highlights. In ACPR, 322–327. IEEE.
  20. Detecting Moments and Highlights in Videos via Natural Language Queries. In NeurIPS, 11846–11858.
  21. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. In ECCV, 447–463. Springer.
  22. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. In NeurIPS, 9694–9705.
  23. Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning. In IEEE CVPR, 3022–3031.
  24. UniVTG: Towards Unified Video-Language Temporal Grounding. CoRR, abs/2307.16715.
  25. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In ICLR.
  26. Multi-task deep visual-semantic embedding for video thumbnail selection. In IEEE CVPR, 3707–3715.
  27. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In IEEE CVPR, 3042–3051.
  28. DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization. In EMNLP-IJCNLP, 5143–5152. ACL.
  29. Univl: A unified video and language pre-training model for multimodal understanding and generation. ArXiv preprint ArXiv:2002.06353.
  30. End-to-End Learning of Visual Representations From Uncurated Instructional Videos. In IEEE CVPR, 9876–9886.
  31. PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation. In MM, 600–608. ACM.
  32. Query-dependent video representation for moment retrieval and highlight detection. In IEEE CVPR, 23023–23033.
  33. Local-Global Video-Text Interactions for Temporal Grounding. In IEEE CVPR, 10807–10816. IEEE.
  34. Glove: Global Vectors for Word Representation. In EMNLP, 1532–1543. ACL.
  35. Learning Transferable Visual Models From Natural Language Supervision. In ICML, volume 139, 8748–8763.
  36. The graph neural network model. TNNLS, 20(1): 61–80.
  37. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR.
  38. To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos. In ACM CIKM, 659–668.
  39. TVSum: Summarizing web videos using titles. In IEEE CVPR, 5179–5187.
  40. Learning Video Representations using Contrastive Bidirectional Transformer.
  41. Attention is all you need. In NeurIPS, 5998–6008.
  42. Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention. In ECCV, 300–316. Springer.
  43. Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding. In AAAI, 2613–2623. AAAI Press.
  44. Less Is More: Learning Highlight Detection From Video Duration. In IEEE CVPR, 1258–1267.
  45. Dynamic Coattention Networks For Question Answering. In ICLR.
  46. Dual-Stream Multimodal Learning for Topic-Adaptive Video Highlight Detection. In ICMR, 272–279. ACM.
  47. Multilevel Language and Vision Integration for Text-to-Clip Retrieval. In AAAI, 9062–9069.
  48. Cross-category Video Highlight Detection via Set-based Learning. In IEEE ICCV, 7950–7959.
  49. Multimodal Learning With Transformers: A Survey. IEEE TPAMI, 45: 12113–12132.
  50. MH-DETR: Video Moment and Highlight Detection with Cross-modal Transformer. ArXiv preprint ArXiv:2305.00355.
  51. Multimodal feature fusion based on object relation for video captioning. CAAI TRIT, 8(1): 247–259.
  52. Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion. In IEEE ICCV, 7930–7939.
  53. Unsupervised Video Summarization With Cycle-Consistent Adversarial LSTM Networks. IEEE TMM, 22(10): 2711–2722.
  54. Sentence Specified Dynamic Video Thumbnail Generation. In MM, 2332–2340. ACM.
  55. Video Moment Retrieval with Hierarchical Contrastive Learning. In MM, 346–355. ACM.
  56. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment. In IEEE CVPR, 1247–1257.
  57. Natural language video localization: A revisit in span-based question answering framework. IEEE transactions on pattern analysis and machine intelligence, 44(8): 4252–4266.
  58. Temporal Sentence Grounding in Videos: A Survey and Future Directions. IEEE TPAMI, 45(8): 10443–10465.
  59. Video Summarization with Long Short-Term Memory. In ECCV, 766–782. Springer.
  60. Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language. In AAAI, 12870–12877.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Hao Sun (383 papers)
  2. Mingyao Zhou (1 paper)
  3. Wenjing Chen (33 papers)
  4. Wei Xie (151 papers)
Citations (21)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com