Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-grained Controllable Video Generation via Object Appearance and Context (2312.02919v1)

Published 5 Dec 2023 in cs.CV

Abstract: Text-to-video generation has shown promising results. However, by taking only natural languages as input, users often face difficulties in providing detailed information to precisely control the model's output. In this work, we propose fine-grained controllable video generation (FACTOR) to achieve detailed control. Specifically, FACTOR aims to control objects' appearances and context, including their location and category, in conjunction with the text prompt. To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model. Our model consists of a joint encoder and adaptive cross-attention layers. By optimizing the encoder and the inserted layer, we adapt the model to generate videos that are aligned with both text prompts and fine-grained control. Compared to existing methods relying on dense control signals such as edge maps, we provide a more intuitive and user-friendly interface to allow object-level fine-grained control. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users. Extensive experiments on standard benchmark datasets and user-provided inputs validate that our model obtains a 70% improvement in controllability metrics over competitive baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Spatext: Spatio-textual representation for controllable image generation. In CVPR, 2023.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  3. Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113, 2023.
  4. Simple online and realtime tracking. In ICIP, 2016.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
  6. Stablevideo: Text-driven consistency-aware diffusion video editing. arXiv preprint arXiv:2308.09592, 2023.
  7. Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840, 2023.
  8. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023.
  9. Eve: Efficient zero-shot text-based video editing with depth map guidance and temporal consistency constraints. arXiv preprint arXiv:2308.10648, 2023.
  10. Medm: Mediating image diffusion models for video-to-video translation with temporal correspondence guidance. arXiv preprint arXiv:2308.10079, 2023.
  11. Diffusion self-guidance for controllable image generation. In NeurIPS, 2023.
  12. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  13. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  14. Encoder-based domain tuning for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228, 2023.
  15. Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
  16. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373, 2023.
  17. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  18. Modulating pretrained diffusion models for multimodal image synthesis. In SIGGRAPH, 2023.
  19. Animate-a-story: Storytelling with retrieval-augmented video generation. arXiv preprint arXiv:2307.06940, 2023.
  20. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  21. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
  22. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
  23. Videocontrolnet: A motion-guided video-to-video translation framework by using diffusion model with controlnet. arXiv preprint arXiv:2307.14073, 2023.
  24. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017.
  25. Style-a-video: Agile diffusion for arbitrary text-based video style transfer. arXiv preprint arXiv:2305.05464, 2023.
  26. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In ICCV, 2023.
  27. Dense text-to-image generation with attention modulation. In ICCV, 2023.
  28. Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
  29. Gligen: Open-set grounded text-to-image generation. In CVPR, 2023.
  30. Magicedit: High-fidelity and temporally coherent video editing. arXiv preprint arXiv:2308.14749, 2023.
  31. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  32. Follow your pose: Pose-guided text-to-video generation using pose-free videos. arXiv preprint arXiv:2304.01186, 2023.
  33. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  34. Codef: Content deformation fields for temporally consistent video processing. arXiv preprint arXiv:2308.07926, 2023.
  35. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  36. Instructvid2vid: Controllable video editing with natural language instructions. arXiv preprint arXiv:2305.12328, 2023.
  37. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  39. Edit-a-video: Single video editing with object-aware consistency. arXiv preprint arXiv:2303.07945, 2023.
  40. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  41. Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
  42. Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023.
  43. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023.
  44. Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599, 2023.
  45. Videocomposer: Compositional video synthesis with motion controllability. In NeurIPS, 2023.
  46. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In ICCV, 2023.
  47. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021.
  48. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV, 2022.
  49. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
  50. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, 2023.
  51. Make-your-video: Customized video generation using textual and structural guidance. arXiv preprint arXiv:2306.00943, 2023.
  52. Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023.
  53. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  54. Freestyle layout-to-image synthesis. In CVPR, 2023.
  55. Rerender a video: Zero-shot text-guided video-to-video translation. In SIGGRAPH Asia, 2023.
  56. Reco: Region-controlled text-to-image generation. In CVPR, 2023.
  57. Freedom: Training-free energy-guided conditional diffusion model. In ICCV, 2023.
  58. Scenecomposer: Any-level semantic image synthesis. In CVPR, 2023.
  59. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  60. Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077, 2023.
  61. Uni-controlnet: All-in-one control to text-to-image diffusion models. In NeurIPS, 2023.
  62. Make-a-protagonist: Generic video editing with an ensemble of experts. arXiv preprint arXiv:2305.08850, 2023.
  63. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hsin-Ping Huang (10 papers)
  2. Yu-Chuan Su (22 papers)
  3. Deqing Sun (68 papers)
  4. Lu Jiang (90 papers)
  5. Xuhui Jia (22 papers)
  6. Yukun Zhu (33 papers)
  7. Ming-Hsuan Yang (377 papers)
Citations (7)