Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Sentence Grounding for Long-term Instructional Video (2312.14055v2)

Published 21 Dec 2023 in cs.CV

Abstract: In this paper, we aim to establish an automatic, scalable pipeline for denoising the large-scale instructional dataset and construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep. We make the following contributions: (i) improving the quality of sentences in dataset by upgrading ASR systems to reduce errors from speech recognition and prompting a LLM to transform noisy ASR transcripts into descriptive steps; (ii) proposing a Transformer-based architecture with all texts as queries, iteratively attending to the visual features, to temporally align the generated steps to corresponding video segments. To measure the quality of our curated datasets, we train models for the task of multi-sentence grounding on it, i.e., given a long-form video, and associated multiple sentences, to determine their corresponding timestamps in the video simultaneously, as a result, the model shows superior performance on a series of multi-sentence grounding tasks, surpassing existing state-of-the-art methods by a significant margin on three public benchmarks, namely, 9.0% on HT-Step, 5.1% on HTM-Align and 1.9% on CrossTask. All codes, models, and the resulting dataset have been publicly released.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the International Conference on Computer Vision, 2021.
  3. Whisperx: Time-accurate speech transcription of long-form audio. In Conference of the International Speech Communication Association., 2023.
  4. Weakly supervised action labeling in videos under ordering constraints. In Proceedings of the European Conference on Computer Vision, 2014.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  6. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  7. All you may need for vqa are image captions. arXiv preprint arXiv:2205.01883, 2022.
  8. Multimodal clustering networks for self-supervised learning from unlabeled videos. In Proceedings of the International Conference on Computer Vision, 2021.
  9. Movie/script: Alignment and parsing of video and text transcription. In Proceedings of the European Conference on Computer Vision, 2008.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Large-scale weakly-supervised pre-training for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  12. Temporal alignment networks for long-term video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  13. Wikihow: A large scale text summarization dataset. arXiv preprint arXiv:1810.09305, 2018.
  14. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, 2022.
  15. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  16. Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  17. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  18. Learning to ground instructional articles in videos through narrations. arXiv preprint arXiv:2306.03802, 2023.
  19. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the International Conference on Computer Vision, 2019.
  20. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  21. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  22. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  23. Subtitle-free movie to script alignment. In Proceedings of the British Machine Vision Conference, 2009.
  24. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  25. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, 2021.
  26. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, 2023.
  27. Howtocaption: Prompting llms to transform video annotations at scale. arXiv preprint arXiv:2310.04900, 2023.
  28. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 2023.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  30. Videomae v2: Scaling video masked autoencoders with dual masking. arXiv preprint arXiv:2303.16727, 2023.
  31. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  32. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, 2018.
  33. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the International Conference on Computer Vision, 2021.
  34. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 2021.
  35. Learning video representations from large language models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  36. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  37. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Citations (2)

Summary

We haven't generated a summary for this paper yet.