Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Iterative Refinement with Diffusion Models for Video Grounding (2310.17189v2)

Published 26 Oct 2023 in cs.CV

Abstract: Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span in a single-shot manner, resulting in the absence of a systematical prediction refinement process. In this paper, we propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task, where the target span is generated from Gaussian noise inputs and interatively refined in the reverse diffusion process. During training, DiffusionVG progressively adds noise to the target span with a fixed forward diffusion process and learns to recover the target span in the reverse diffusion process. In inference, DiffusionVG can generate the target span from Gaussian noise inputs by the learned reverse diffusion process conditioned on the video-sentence representations. Without bells and whistles, our DiffusionVG demonstrates superior performance compared to existing well-crafted models on mainstream Charades-STA, ActivityNet Captions and TACoS benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. “Tall: Temporal activity localization via language query,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  2. “Multilevel language and vision integration for text-to-clip retrieval,” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Apr 2018.
  3. “Learning 2d temporal adjacent networks for moment localization with natural language,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 12870–12877.
  4. “Semantic conditioned dynamic modulation for temporal sentence grounding in videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence,IEEE Transactions on Pattern Analysis and Machine Intelligence, May 2022.
  5. “Dense regression network for video grounding,” arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, Apr 2020.
  6. “Interventional video grounding with dual contrastive learning,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2021.
  7. “Natural language video localization: A revisit in span-based question answering framework.,” IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1–1, Jan 2021.
  8. “Attention is all you need,” Neural Information Processing Systems,Neural Information Processing Systems, Jun 2017.
  9. “Qvhighlights: Detecting moments and highlights in videos via natural language queries,” Cornell University - arXiv,Cornell University - arXiv, Jul 2021.
  10. “Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3042–3051.
  11. “Query-dependent video representation for moment retrieval and highlight detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23023–23033.
  12. “Deep unsupervised learning using nonequilibrium thermodynamics,” arXiv: Learning,arXiv: Learning, Mar 2015.
  13. “Denoising diffusion probabilistic models.,” Neural Information Processing Systems,Neural Information Processing Systems, Jan 2020.
  14. “Diffusiondet: Diffusion model for object detection,” arXiv preprint arXiv:2211.09788, 2022.
  15. “Dense-captioning events in videos,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 706–715.
  16. “Grounding action descriptions in videos,” Transactions of the Association for Computational Linguistics, p. 25–36, Dec 2013.
  17. “Denoising diffusion implicit models,” arXiv: Learning,arXiv: Learning, Oct 2020.
  18. “Generalized intersection over union: A metric and a loss for bounding box regression,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
  19. End-to-End Object Detection with Transformers, pp. 213–229, 2020.
  20. “Localizing moments in video with natural language,” in 2017 IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  21. “Cascaded prediction network via segment tree for temporal video grounding,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2021.
  22. “Generative modeling by estimating gradients of the data distribution,” Cornell University - arXiv,Cornell University - arXiv, Jul 2019.
  23. “Score-based generative modeling through stochastic differential equations,” International Conference on Learning Representations,International Conference on Learning Representations, May 2021.
  24. “Frido: Feature pyramid diffusion for complex scene image synthesis,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, pp. 579–587, 2023.
  25. “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36479–36494, 2022.
  26. “Flexible diffusion modeling of long videos,” Advances in Neural Information Processing Systems, vol. 35, pp. 27953–27965, 2022.
  27. “Structured denoising diffusion models in discrete state-spaces,” arXiv: Learning,arXiv: Learning, Jul 2021.
  28. “Diffuseq: Sequence to sequence text generation with diffusion models,” arXiv preprint arXiv:2210.08933, Oct 2022.
  29. “Diffusion-lm improves controllable text generation,” Advances in Neural Information Processing Systems, vol. 35, pp. 4328–4343, 2022.
  30. “Label-efficient semantic segmentation with diffusion models,” arXiv preprint arXiv:2112.03126, 2021.
  31. “Diffusion models for implicit image segmentation ensembles,” in International Conference on Medical Imaging with Deep Learning. PMLR, 2022, pp. 1336–1348.
  32. “Segdiff: Image segmentation with diffusion probabilistic models,” arXiv preprint arXiv:2112.00390, 2021.
  33. “Videobert: A joint model for video and language representation learning,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2019.
  34. “Hero: Hierarchical encoder for video+language omni-representation pre-training,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Jan 2020.
  35. “Less is more: Clipbert for video-and-language learning via sparse sampling,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7331–7341.
  36. “Vlm: Task-agnostic video-language model pre-training for video understanding,” arXiv preprint arXiv:2105.09996, 2021.
  37. “End-to-end dense video captioning with masked transformer,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun 2018.
  38. “Tvqa: Localized, compositional video question answering,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Jan 2018.
  39. “Msr-vtt: A large video description dataset for bridging video and language,” IEEE Conference Proceedings,IEEE Conference Proceedings, Jan 2016.
  40. “Diffusion models beat gans on image synthesis,” arXiv: Learning,arXiv: Learning, May 2021.
  41. “A generalist framework for panoptic segmentation of images and videos,” arXiv preprint arXiv:2210.06366, Oct 2022.
  42. “Decoupled weight decay regularization,” Learning,Learning, Nov 2017.
  43. “Quo vadis, action recognition? a new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul 2017.
  44. “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  45. “Learning spatiotemporal features with 3d convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), Dec 2015.
  46. “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiao Liang (132 papers)
  2. Tao Shi (73 papers)
  3. Yaoyuan Liang (5 papers)
  4. Te Tao (1 paper)
  5. Shao-Lun Huang (48 papers)
Citations (1)