GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos (2312.07322v2)
Abstract: We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First, we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states, actions, and resulting object transformations. Second, equipped with this data, we develop and train a conditioned diffusion model dubbed GenHowTo. Third, we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular, we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories, respectively, outperforming prior work by a large margin.
- Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
- Universal guidance for diffusion models. In CVPR, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
- Learning video-conditioned policies for unseen manipulation tasks. In ICRA, 2023.
- Learning generalizable robotic reward functions from “in-the-wild” human videos. arXiv preprint arXiv:2103.16817, 2021.
- Toward realistic image compositing with adversarial learning. In CVPR, 2019.
- Pre-trained image processing transformer. In CVPR, 2021.
- Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
- Diffusion models beat gans on image synthesis. NeurIPS, 2021.
- Action modifiers: Learning from adverbs in instructional videos. In CVPR, 2020.
- Learning universal policies via text-guided video generation. NeurIPS, 2023.
- Stepformer: Self-supervised step discovery and localization in instructional videos. In CVPR, 2023.
- Learning temporal dynamics from cycles in narrated video. In CVPR, 2021.
- Taming transformers for high-resolution image synthesis. In CVPR, 2021.
- Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV. Springer, 2022.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Intrinsic image harmonization. In CVPR, 2021.
- Temporal alignment networks for long-term video. In CVPR, 2022.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 2017.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Classifier-free diffusion guidance. In NeurIPS Workshops, 2021.
- An edit friendly ddpm noise space: Inversion and manipulations. arXiv preprint arXiv:2304.06140, 2023.
- Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
- Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
- Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
- Variational diffusion models. NeurIPS, 2021.
- Putting people in their place: Affordance-aware human insertion into scenes. In CVPR, 2023.
- Predicting future frames using retrospective cycle gan. In CVPR, 2019.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
- Dual motion gan for future-flow embedded video prediction. In ICCV, 2017.
- Compositional visual generation with composable diffusion models. In ECCV, 2022.
- Deep predictive coding networks for video prediction and unsupervised learning. ICLR, 2017.
- Future frame prediction network for video anomaly detection. TPAMI, 2021.
- Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2021.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, October 2019.
- Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
- Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
- Learning graph embeddings for compositional zero-shot learning. In CVPR, 2021.
- Contextual imagined goals for self-supervised robotic learning. In CoRL, 2020.
- Visual reinforcement learning with imagined goals. NeurIPS, 2018.
- Planning with goal-conditioned policies. NeurIPS, 2019.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Conditional image synthesis with auxiliary classifier gans. In ICML, 2017.
- Semantic image synthesis with spatially-adaptive normalization. In CVPR, 2019.
- Zero-shot image-to-image translation. In SIGGRAPH, 2023.
- Localizing object-level shape variations with text-to-image diffusion models. In ICCV, 2023.
- Zero-shot visual imitation. In ICLR, 2018.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Zero-shot text-to-image generation. In ICML, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Palette: Image-to-image diffusion models. In SIGGRAPH, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
- Chop & learn: Recognizing and generating object-state compositions. In ICCV, 2023.
- Convolutional lstm network: A machine learning approach for precipitation nowcasting. NeurIPS, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Multi-task learning of object state changes from uncurated videos. arXiv preprint arXiv:2211.13500, 2022.
- Look for the change: Learning object states and state-modifying actions from untrimmed web videos. In CVPR, 2022.
- Unsupervised learning of video representations using lstms. In ICML, 2015.
- Coin: A large-scale dataset for comprehensive instructional video analysis. In CVPR, 2019.
- Error-tolerant image compositing. IJCV, 2013.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
- Manipulate by seeing: Creating manipulation controllers from pre-trained representations. In ICCV, 2023.
- Pretraining is all you need for image-to-image translation. arXiv preprint arXiv:2205.12952, 2022.
- High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR, 2018.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
- Feature prediction diffusion model for video anomaly detection. In ICCV, 2023.
- Paint by example: Exemplar-based image editing with diffusion models. In CVPR, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Cross-domain correspondence learning for exemplar-based image translation. In CVPR, 2020.
- P3iv: Probabilistic procedure planning from instructional videos with weak supervision. In CVPR, 2022.
- Learning procedure-aware video representation from instructional videos and their narrations. In CVPR, 2023.
- Cocosnet v2: Full-resolution correspondence learning for image translation. In CVPR, 2021.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
- Toward multimodal image-to-image translation. NeurIPS, 2017.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.