Papers
Topics
Authors
Recent
2000 character limit reached

Step Differences in Instructional Video (2404.16222v2)

Published 24 Apr 2024 in cs.CV

Abstract: Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned LLM to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  3. Hiervl: Learning hierarchical video-language embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23066–23078, 2023.
  4. Detours for navigating instructional videos. In CVPR, 2024.
  5. Visual question answering on image sets. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 51–67. Springer, 2020.
  6. My view is the best view: Procedure learning from egocentric videos. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, pages 657–675. Springer, 2022.
  7. Dense events grounding in video. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 920–928, 2021.
  8. Compare and contrast: Learning prominent visual differences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1267–1276, 2018.
  9. Daffy. Htstep. In NeurIPS (Datasets and Benchmarks), 2023.
  10. Neural naturalist: generating fine-grained image comparisons. arXiv preprint arXiv:1909.04101, 2019.
  11. Fashionvlp: Vision language transformer for fashion retrieval with feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14105–14115, 2022.
  12. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  13. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022.
  14. Image change captioning by learning from an auxiliary task. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2725–2734, 2021.
  15. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  16. Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584, 2018.
  17. Agnostic change captioning with cycle consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2095–2104, 2021.
  18. An end-to-end generative framework for video segmentation and recognition. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–8. IEEE, 2016.
  19. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  20. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  21. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  23. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605–612, 2004.
  24. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022a.
  25. Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13853–13863, 2022b.
  26. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  27. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  28. Learning to ground instructional articles in videos through narrations. arXiv preprint arXiv:2306.03802, 2023.
  29. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  30. Scaling open-vocabulary object detection. arXiv preprint arXiv:2306.09683, 2023.
  31. Anymal: An efficient and scalable any-modality augmented language model. arXiv preprint arXiv:2309.16058, 2023.
  32. OpenAI. Gpt4v. ???, 2023.
  33. Relative attributes. In 2011 International Conference on Computer Vision, pages 503–510. IEEE, 2011.
  34. Robust change captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4624–4633, 2019.
  35. Changesim: Towards end-to-end online scene change detection in industrial indoor environments. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8578–8585. IEEE, 2021.
  36. Answer mining from a pool of images: Towards retrieval-based visual question answering. arXiv preprint arXiv:2306.16713, 2023.
  37. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  38. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5285–5297, 2023.
  39. Describing and localizing multiple changes with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1971–1980, 2021.
  40. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  41. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36, 2013.
  42. The change you want to see (now in 3d). In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2060–2069, 2023a.
  43. The change you want to see. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3993–4002, 2023b.
  44. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  45. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022.
  46. Actor and observer: Joint modeling of first and third-person videos. In proceedings of the IEEE conference on computer vision and pattern recognition, pages 7396–7404, 2018.
  47. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  48. Coin: A large-scale dataset for comprehensive instructional video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1207–1216, 2019.
  49. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  51. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  52. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  53. Covr: Learning composed video retrieval from web video captions. arXiv preprint arXiv:2308.14746, 2023.
  54. Composing text and image for image retrieval-an empirical odyssey. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6439–6448, 2019.
  55. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  56. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021a.
  57. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317, 2021b.
  58. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021.
  59. Yale. Goalstep. In NeurIPS (Datasets and Benchmarks), 2023.
  60. Learning to answer visual questions from web videos. arXiv preprint arXiv:2205.05019, 2022.
  61. Fine-grained visual comparisons with local learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 192–199, 2014.
  62. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019.
  63. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023.
  64. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  65. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023b.
  66. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
  67. Video question answering: Datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225, 2022.
  68. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  69. Cross-task weakly supervised learning from instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3537–3545, 2019.
  70. Spot-the-difference self-supervised pre-training for anomaly detection and segmentation. In European Conference on Computer Vision, pages 392–408. Springer, 2022.
Citations (3)

Summary

  • The paper introduces a novel video-conditioned language model (VCLM) for automatically comparing steps between pairs of instructional videos.
  • The authors propose an innovative method for generating a large dataset and introduce a benchmark for evaluating video comparison tasks.
  • Experiments show the VCLM achieves state-of-the-art results on difference captioning, recognition, and ranking, enabling potential AR/VR applications.

Analyzing the Step Differences in Instructional Video Paper

The paper "Step Differences in Instructional Video" by Tushar Nagarajan and Lorenzo Torresani addresses a specific challenge in AR/VR applications: the automatic comparison of user-generated content against reference instructional videos to provide personalized assistance. Central to this work is the ability of AR/VR systems to detect and describe differences between pairs of instructional videos, a task vital for applications like progress tracking and mistake detection. This paper is situated within the context of leveraging large datasets and AI to enhance instructional video understanding.

Core Contributions

The authors present a novel approach that utilizes a video-conditioned LLM (VCLM) to compare instructional videos. This involves several key steps:

  1. Dataset Generation: The paper introduces an innovative method for automatically generating a large-scale dataset from the HowTo100M collection. By leveraging existing step annotations and narrations within these videos, the framework generates paired video segments annotated with action descriptions and object detections. Large-scale models such as LLaMA are then used to create question-answer pairs regarding the differences in these paired segments.
  2. Modeling Approach: The proposed VCLM is trained to recognize and articulate differences between paired video segments. This model uniquely conditions its reasoning on visual data from two videos, allowing it to answer questions that require joint reasoning across both.
  3. Benchmark Introduction: The authors introduce a benchmark dataset for evaluating models on video comparison tasks. This dataset includes 6292 video pairs manually annotated with difference captions across various categories such as tools and techniques, which facilitates a robust assessment of model performance in detecting and categorizing video differences.

Experimental Insights

The experiments demonstrate that the proposed model achieves state-of-the-art performance on the newly introduced tasks of Difference Captioning (DiffCap), Difference Recognition (DiffMCQ), and Difference Ranking (DiffRank). The VCLM framework is shown to excel particularly in complex scenarios requiring nuanced understanding, such as identifying subtle variations in tools or techniques.

  1. Difference Captioning: The model's ability to generate human-like descriptions of differences outperformed existing baseline methods, validating the use of weak supervision from automatically generated data.
  2. Difference Recognition and Ranking: By jointly reasoning over videos, the model effectively distinguishes and ranks videos based on their differences, a feature crucial for personalized assistance applications.

Implications and Future Directions

The approach highlights the potential of using AI to bridge the gap in personalized instructional content by analyzing procedural details within videos. Going forward, the integration of such video-conditioned models into AR/VR ecosystems could significantly enhance user interaction by providing real-time feedback and advice.

Moreover, the paper opens avenues for future work in other domains of AI video analysis. For instance, integrating this differential analysis capability with retrieval systems could enhance video search by allowing users to query complex, high-level video differences, thereby advancing domains like content-based video retrieval and automated content curation.

Overall, this research provides a crucial step towards sophisticated, interactive video analysis systems that could eventually lead to significant advancements in how users consume and engage with instructional content. As AI models continue to evolve, their application in contexts requiring fine-grained visual understanding and reasoning promises widespread implications for educational technology and beyond.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 0 likes about this paper.

HackerNews

  1. Step Differences in Instructional Video (15 points, 0 comments)