Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CI w/o TN: Context Injection without Task Name for Procedure Planning (2402.15579v1)

Published 23 Feb 2024 in cs.CV and cs.CL

Abstract: This paper explores the challenge of procedure planning in instructional videos, which involves creating goal-directed plans based on visual start and goal observations from videos. Previous research has tackled this problem with gradually weaker training supervision, from heavy intermediate visual observations or language instructions to task class supervision. However, with the advent of LLMs, even given only the task name, these models can produce a detailed plan. In this study, we propose a much weaker setting without task name as supervision, which is not currently solvable by existing LLMs since they require good prompts with sufficient information. Specifically, we hypothesize that previous intermediate supervisions can serve as context information, and we use captions of visual start and goal observations as a much cheaper form of supervision. This approach greatly reduces the labeling cost since the captions can be easily obtained by large pre-trained vision-language foundation models. Technically, we apply BLIP to generate captions as supervision to train the context feature with contrastive learning loss. Afterward, the context feature is fed into the generator to aid in plan generation. Our experiments on two datasets with varying scales demonstrate that our model can achieve comparable performance on multiple metrics, which validates our hypothesis.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Yazan Abu Farha and Juergen Gall. 2019. Uncertainty-aware anticipation of activities.
  2. Stap: Sequencing task-agnostic policies. arXiv preprint arXiv:2210.12250.
  3. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  4. Procedure planning in instructional videos via contextual modeling and model-based policy learning.
  5. Language models are few-shot learners. 33:1877–1901.
  6. Procedure planning in instructional videos.
  7. Who let the dogs out? Modeling dog behavior from visual data.
  8. Automatic goal generation for reinforcement learning agents.
  9. Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819.
  10. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207.
  11. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
  12. Leslie Pack Kaelbling. 1993. Hierarchical learning in stochastic domains: Preliminary results.
  13. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  14. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753.
  15. It is not the journey but the destination: Endpoint conditioned trajectory prediction.
  16. Learning and verification of task structure in instructional videos. arXiv preprint arXiv:2303.13519.
  17. Learning transferable visual models from natural language supervision. pages 8748–8763. PMLR.
  18. Language models are unsupervised multitask learners.
  19. Top-down visual attention from analysis by synthesis. arXiv preprint arXiv:2303.13043.
  20. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302.
  21. Universal planning networks: Learning generalizable representations for visuomotor control.
  22. PlaTe: Visually-grounded planning with transformers in procedural tasks. arXiv preprint arXiv:2109.04869v1.
  23. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128.
  24. COIN: A large-scale dataset for comprehensive instructional video analysis.
  25. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. pages 17918–17928.
  26. Andrew Viterbi. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2):260–269.
  27. Pdpp: Projected diffusion for procedure planning in instructional videos. arXiv preprint arXiv:2303.14676.
  28. From association to generation: Text-only captioning by unsupervised cross-modal mapping. arXiv preprint arXiv:2304.13273.
  29. Cap4video: What can auxiliary captions do for text-video retrieval? arXiv preprint arXiv:2301.00184.
  30. Visualhow: Multimodal problem solving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15627–15637.
  31. Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685.
  32. Parsel: A unified natural language framework for algorithmic reasoning. arXiv preprint arXiv:2212.10561.
  33. P3iv: Probabilistic procedure planning from instructional videos with weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2938–2948.
  34. Learning procedure-aware video representation from instructional videos and their narrations. arXiv preprint arXiv:2303.17839.
  35. Procedure-aware pretraining for instructional video understanding. arXiv preprint arXiv:2303.18230.
  36. Cross-task weakly supervised learning from instructional videos.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Xinjie Li (12 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com