Pix2Gif: Motion-Guided Diffusion for GIF Generation
Abstract: We present Pix2Gif, a motion-guided diffusion model for image-to-GIF (video) generation. We tackle this problem differently by formulating the task as an image translation problem steered by text and motion magnitude prompts, as shown in teaser fig. To ensure that the model adheres to motion guidance, we propose a new motion-guided warping module to spatially transform the features of the source image conditioned on the two types of prompts. Furthermore, we introduce a perceptual loss to ensure the transformed feature map remains within the same space as the target image, ensuring content consistency and coherence. In preparation for the model training, we meticulously curated data by extracting coherent image frames from the TGIF video-caption dataset, which provides rich information about the temporal changes of subjects. After pretraining, we apply our model in a zero-shot manner to a number of video datasets. Extensive qualitative and quantitative experiments demonstrate the effectiveness of our model -- it not only captures the semantic prompt from text but also the spatial ones from motion guidance. We train all our models using a single node of 16xV100 GPUs. Code, dataset and models are made public at: https://hiteshk03.github.io/Pix2Gif/.
- Futuregan: Anticipating the future frames of video sequences using spatio-temporal 3d convolutions in progressively growing gans, 2018.
- Blended latent diffusion. ACM Transactions on Graphics, 42(4):1–11, 2023.
- Future video prediction from a single frame for video anomaly detection, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models, 2023b.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- Maskgit: Masked generative image transformer, 2022.
- Muse: Text-to-image generation via masked generative transformers, 2023.
- Ilvr: Conditioning method for denoising diffusion probabilistic models, 2021.
- Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
- Taming transformers for high-resolution image synthesis, 2021.
- Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
- Guiding instruction-based image editing via multimodal large language models, 2023.
- Video representation learning through prediction for online object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 530–539, 2022.
- Preserve your own correlation: A noise prior for video diffusion models, 2023.
- Cater: A diagnostic dataset for compositional actions and temporal reasoning, 2020.
- Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- Biomedjourney: Counterfactual biomedical image generation by instruction-learning from multimodal patient journeys. arXiv preprint arXiv:2310.10765, 2023.
- Latent video diffusion models for high-fidelity long video generation, 2023.
- Prompt-to-prompt image editing with cross attention control, 2022.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models, 2020.
- Video diffusion models, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022.
- Make it move: Controllable image-to-video generation with text descriptions, 2022.
- Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
- Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6007–6017, 2023.
- Auto-encoding variational bayes, 2022.
- Mutual suppression network for video prediction using disentangled features, 2019.
- Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
- Tgif: A new dataset and benchmark on animated gif description, 2016.
- Gligen: Open-set grounded text-to-image generation, 2023.
- Future frame prediction for anomaly detection – a new baseline, 2018.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps, 2022.
- Sdedit: Guided image synthesis and editing with stochastic differential equations, 2022.
- Expanding language-image pretrained models for general video recognition, 2022.
- Conditional image-to-video generation with latent flow diffusion models, 2023.
- Action-conditional video prediction using deep networks in atari games, 2015.
- Folded recurrent neural networks for future video prediction, 2018.
- Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- High-resolution image synthesis with latent diffusion models, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Convolutional lstm network: A machine learning approach for precipitation nowcasting, 2015.
- Very deep convolutional networks for large-scale image recognition, 2015.
- Make-a-video: Text-to-video generation without text-video data, 2022.
- Denoising diffusion implicit models, 2022.
- Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012.
- Unsupervised learning of video representations using lstms, 2016.
- Fvd: A new metric for video generation. 2019.
- Pixel recurrent neural networks, 2016a.
- Conditional image generation with pixelcnn decoders, 2016b.
- Neural discrete representation learning, 2018.
- Decomposing motion and content for natural video sequence prediction, 2018.
- Phenaki: Variable length video generation from open domain textual description, 2022.
- Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation, 2022.
- Dreamvideo: High-fidelity image-to-video generation with image retention and text guidance. arXiv preprint arXiv:2312.03018, 2023a.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
- Scaling autoregressive video models, 2020.
- Godiva: Generating open-domain videos from natural descriptions, 2021a.
- Nüwa: Visual synthesis pre-training for neural visual world creation, 2021b.
- Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
- Reco: Region-controlled text-to-image generation, 2022.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
- Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023a.
- Scaling autoregressive multi-modal models: Pretraining and instruction tuning, 2023b.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023a.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023b.
- 3d u-net: Learning dense volumetric segmentation from sparse annotation, 2016.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.