Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection (2404.09263v1)

Published 14 Apr 2024 in cs.CV and cs.AI

Abstract: Video moment retrieval and highlight detection are two highly valuable tasks in video understanding, but until recently they have been jointly studied. Although existing studies have made impressive advancement recently, they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects, resulting in poor model performance. In this paper, we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks, we propose an inter-task feedback mechanism, which transforms the results of one task as guiding masks to assist the other task. Different from existing methods, we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights, TVSum, and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at https://github.com/EdenGabriel/TaskWeave.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  2. Niki Parmar Ashish Vaswani, Noam Shazeer. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5999–6009, 2017.
  3. Joint visual and audio learning for video highlight detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8127–8137, 2021.
  4. Contrastive learning for unsupervised video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14042–14052, 2022.
  5. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  6. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  7. Rich Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
  8. Localizing natural language in videos. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8175–8182, 2019.
  9. Semantic proposal for activity localization in videos via sentence query. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 8199–8206, 2019.
  10. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  13. Fast video moment retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1523–1532, 2021.
  14. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017.
  15. Ana Garcia del Molino and Michael Gygli. Phd-gifs: personalized highlight detection for automatic gif creation. In Proceedings of the 26th ACM international conference on Multimedia, pages 600–608, 2018.
  16. Dynamic task prioritization for multitask learning. In Proceedings of the European conference on computer vision (ECCV), pages 270–287, 2018.
  17. Ranking info noise contrastive estimation: Boosting contrastive learning via ranked positives. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 897–905, 2022.
  18. Mini-net: Multiple instance ranking network for video highlight detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIII 16, pages 345–360. Springer, 2020.
  19. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  20. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023.
  21. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  22. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017.
  23. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018.
  24. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  25. Detecting moments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34:11846–11858, 2021.
  26. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022a.
  27. Asymmetric relation consistency reasoning for video relation grounding. In European Conference on Computer Vision, pages 125–141. Springer, 2022b.
  28. Inverse compositional learning for weakly-supervised relation grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15477–15487, 2023.
  29. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023.
  30. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1871–1880, 2019.
  31. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329, 2022a.
  32. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. pages 3042–3051, 2022b.
  33. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1930–1939, 2018.
  34. Unsupervised video summarization with adversarial lstm networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 202–211, 2017.
  35. Audio-visual generalised zero-shot learning with cross-modal attention and language. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10553–10563, 2022.
  36. Query-dependent video representation for moment retrieval and highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23023–23033, 2023.
  37. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10810–10819, 2020.
  38. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  39. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  40. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  41. Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023.
  42. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  43. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5179–5187, 2015.
  44. Ranking domain-specific highlights by analyzing edited videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 787–802. Springer, 2014.
  45. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the 14th ACM Conference on Recommender Systems, pages 269–278, 2020.
  46. Temporalmaxer: Maximize temporal context with only max pooling for temporal action localization. arXiv preprint arXiv:2303.09055, 2023.
  47. Learning trailer moments in full-length movies with co-contrastive attention. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, pages 300–316. Springer, 2020.
  48. Language-driven temporal activity localization: A semantic matching reinforcement learning model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 334–343, 2019.
  49. Learning pixel-level distinctions for video highlight detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3073–3082, 2022.
  50. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10941–10950, 2020.
  51. Less is more: Learning highlight detection from video duration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1258–1267, 2019.
  52. Cross-category video highlight detection via set-based learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7970–7979, 2021.
  53. Mh-detr: Video moment and highlight detection with cross-modal transformer. arXiv preprint arXiv:2305.00355, 2023.
  54. Gated multi-scale transformer for temporal action localization. IEEE Transactions on Multimedia, 2023a.
  55. Cross time-frequency transformer for temporal action localization. IEEE Transactions on Circuits and Systems for Video Technology, 2023b.
  56. Highlight detection with pairwise deep ranking for first-person video summarization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 982–990, 2016.
  57. Temporal cue guided video highlight detection with low-rank audio-visual fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7950–7959, 2021.
  58. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9159–9166, 2019.
  59. Dense regression network for video grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10287–10296, 2020.
  60. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1247–1257, 2019.
  61. Span-based localizing network for natural language video localization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6543–6554, 2020a.
  62. Temporal sentence grounding in videos: A survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10443–10465, 2023a.
  63. Temporal sentence grounding in videos: a survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  64. Video summarization with long short-term memory. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 766–782. Springer, 2016.
  65. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12870–12877, 2020b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jin Yang (73 papers)
  2. Ping Wei (26 papers)
  3. Huan Li (102 papers)
  4. Ziyang Ren (3 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.