Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations (2303.17839v1)

Published 31 Mar 2023 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.MM

Abstract: The abundance of instructional videos and their narrations over the Internet offers an exciting avenue for understanding procedural activities. In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations. Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering. We empirically demonstrate that learning temporal ordering not only enables new capabilities for procedure reasoning, but also reinforces the recognition of individual steps. Our model significantly advances the state-of-the-art results on step classification (+2.8% / +3.3% on COIN / EPIC-Kitchens) and step forecasting (+7.4% on COIN). Moreover, our model attains promising results in zero-shot inference for step classification and forecasting, as well as in predicting diverse and plausible steps for incomplete procedures. Our code is available at https://github.com/facebookresearch/ProcedureVRL.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yiwu Zhong (16 papers)
  2. Licheng Yu (47 papers)
  3. Yang Bai (205 papers)
  4. Shangwen Li (5 papers)
  5. Xueting Yan (4 papers)
  6. Yin Li (150 papers)
Citations (22)

Summary

We haven't generated a summary for this paper yet.