Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation (2306.03988v2)

Published 6 Jun 2023 in cs.CV and cs.AI

Abstract: We propose a novel unsupervised method to autoregressively generate videos from a single frame and a sparse motion input. Our trained model can generate unseen realistic object-to-object interactions. Although our model has never been given the explicit segmentation and motion of each object in the scene during training, it is able to implicitly separate their dynamics and extents. Key components in our method are the randomized conditioning scheme, the encoding of the input motion control, and the randomized and sparse sampling to enable generalization to out of distribution but realistic correlations. Our model, which we call YODA, has therefore the ability to move objects without physically touching them. Through extensive qualitative and quantitative evaluations on several datasets, we show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571.
  2. Stochastic Variational Video Prediction. ArXiv, abs/1710.11252.
  3. FitVid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195.
  4. All are Worth Words: a ViT Backbone for Score-based Diffusion Models. arXiv preprint arXiv:2209.12152.
  5. ipoke: Poking a still image for controlled stochastic video synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 14707–14717.
  6. Understanding object dynamics for interactive image-to-video synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5171–5181.
  7. Neural ordinary differential equations. Advances in neural information processing systems, 31.
  8. Recurrent environment simulators. arXiv preprint arXiv:1704.02254.
  9. Adversarial Video Generation on Complex Datasets. arXiv: Computer Vision and Pattern Recognition.
  10. Controllable Video Generation Through Global and Local Motion Dynamics. In Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision – ECCV 2022, 68–84. Cham: Springer Nature Switzerland. ISBN 978-3-031-19790-1.
  11. Efficient Video Prediction via Sparsely Conditioned Flow Matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 23263–23274.
  12. Stochastic video generation with a learned prior. In International conference on machine learning, 1174–1183. PMLR.
  13. Diffusion Models Beat GANs on Image Synthesis. ArXiv, abs/2105.05233.
  14. Self-Supervised Visual Planning with Temporal Skip Connections. In CoRL.
  15. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12873–12883.
  16. Unsupervised learning for physical interaction through video prediction. Advances in neural information processing systems, 29.
  17. Generative Adversarial Nets. In NIPS.
  18. Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894.
  19. Controllable video generation with sparse trajectories. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7854–7863.
  20. Flexible Diffusion Modeling of Long Videos. arXiv preprint arXiv:2205.11495.
  21. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.
  22. Denoising Diffusion Probabilistic Models. ArXiv, abs/2006.11239.
  23. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
  24. Video diffusion models. arXiv preprint arXiv:2204.03458.
  25. Long Short-Term Memory. Neural Comput., 9(8): 1735–1780.
  26. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696.
  27. Learning to take directions one step at a time. In 2020 25th International Conference on Pattern Recognition (ICPR), 739–746. IEEE.
  28. Make it move: Controllable image-to-video generation with text descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18219–18228.
  29. Layered Controllable Video Generation. In Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision – ECCV 2022, 546–564. Cham: Springer Nature Switzerland. ISBN 978-3-031-19787-1.
  30. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456. pmlr.
  31. Reasoning about physical interactions with object-oriented prediction and planning. arXiv preprint arXiv:1812.10972.
  32. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34: 852–863.
  33. Learning to simulate dynamic environments with gamegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1231–1240.
  34. Ccvs: Context-aware controllable video synthesis. Advances in Neural Information Processing Systems, 34: 14042–14055.
  35. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523.
  36. Revisiting Hierarchical Approach for Persistent Long-Term Video Prediction. ArXiv, abs/2104.06697.
  37. Flow Matching for Generative Modeling. arXiv preprint arXiv:2210.02747.
  38. Liquid warping gan: A unified framework for human motion imitation, appearance transfer and novel view synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5904–5913.
  39. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003.
  40. Decoupled Weight Decay Regularization. In ICLR.
  41. Transformation-based Adversarial Video Prediction on Large-Scale Data. ArXiv, abs/2003.04035.
  42. Playable environments: Video manipulation in space and time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3584–3593.
  43. Playable video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10061–10070.
  44. Action-conditioned benchmarking of robotic video prediction models: a comparative study. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 8316–8322. IEEE.
  45. Action-conditional video prediction using deep networks in atari games. Advances in neural information processing systems, 28.
  46. On Aliased Resizing and Surprising Subtleties in GAN Evaluation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11400–11410.
  47. The book of why: the new science of cause and effect. Basic books.
  48. Latent video transformer. arXiv preprint arXiv:2006.10704.
  49. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695.
  50. Learning what you can do before doing anything. arXiv preprint arXiv:1806.09655.
  51. Object-centric video prediction without annotation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 13604–13610. IEEE.
  52. HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator. ArXiv, abs/2209.07143.
  53. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792.
  54. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, 402–419. Springer.
  55. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1526–1535.
  56. Towards Accurate Generative Models of Video: A New Metric & Challenges. ArXiv, abs/1812.01717.
  57. Attention is All you Need. In Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  58. Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation. arXiv preprint arXiv:2205.09853.
  59. Imaginator: Conditional spatio-temporal gan for video generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 1160–1169.
  60. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13: 600–612.
  61. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634.
  62. Generative video transformer: Can objects be the words? In International Conference on Machine Learning, 11307–11318. PMLR.
  63. Patch-based Object-centric Transformers for Efficient Video Generation. arXiv preprint arXiv:2206.04003.
  64. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157.
  65. CLEVRER: CoLlision Events for Video REpresentation and Reasoning. ArXiv, abs/1910.01442.
  66. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586–595.
Citations (3)

Summary

We haven't generated a summary for this paper yet.